Standard
Generative Pre-trained Transformer Quantization (GPTQ) is a highly efficient, one-shot weight quantization method designed to significantly reduce the memory and computational requirements of large language models (LLMs), such as GPT and OPT. By leveraging approximate second-order information, GPTQ compresses models—often with hundreds of billions of parameters—down to 2–4 bits per weight with minimal accuracy loss. This enables large-scale models to run on a single GPU, drastically improving accessibility and inference speed. GPTQ offers compression gains over previous methods while maintaining performance, making it a key technique for deploying powerful generative models efficiently, particularly in resource-constrained or edge environments.
|
|
identifier | GPTQ |
Publisher | IST Austria Distributed Algorithms and Systems Lab |
Author | Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh |
URL | https://github.com/IST-DASLab/gptq |
title | Generative Pre-trained Transformer Quantization |
version | |