Generative Pre-trained Transformer Quantization

Generative Pre-trained Transformer Quantization ()

Standard

Generative Pre-trained Transformer Quantization (GPTQ) is a highly efficient, one-shot weight quantization method designed to significantly reduce the memory and computational requirements of large language models (LLMs), such as GPT and OPT. By leveraging approximate second-order information, GPTQ compresses models—often with hundreds of billions of parameters—down to 2–4 bits per weight with minimal accuracy loss. This enables large-scale models to run on a single GPU, drastically improving accessibility and inference speed. GPTQ offers compression gains over previous methods while maintaining performance, making it a key technique for deploying powerful generative models efficiently, particularly in resource-constrained or edge environments.


identifier	GPTQ
publisher	IST Austria Distributed Algorithms and Systems Lab
author	Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
URL	https://github.com/IST-DASLab/gptq
title	Generative Pre-trained Transformer Quantization
version



DA-Machine Learning Standards



		NATO Machine Learning Profile - Optional	Generative Pre-trained Transformer Quantization