Quantization for Inference
INT8, FP8, GPTQ, AWQ — shrinking model weights for faster and cheaper inference.
Why Quantize?
Large language models are dominated by matrix multiplications over billions of parameters stored as FP16/BF16. Quantization reduces precision — typically to INT8 or INT4 — to cut memory footprint and increase throughput.
| Precision | Bits | Memory vs FP16 | Typical Accuracy Loss |
|---|---|---|---|
| FP16 / BF16 | 16 | 1× | baseline |
| INT8 | 8 | 0.5× | < 1 % |
| INT4 (GPTQ) | 4 | 0.25× | 1–3 % |
| FP8 (E4M3) | 8 | 0.5× | < 0.5 % |
Post-Training Quantization (PTQ)
PTQ quantizes a pre-trained model using a small calibration dataset. The key idea: learn per-channel or per-group scale factors that minimize quantization error.
# Pseudocode: symmetric per-channel quantization
scale = x.abs().max(dim=-1).values / 127
x_int8 = torch.round(x / scale.unsqueeze(-1)).clamp(-128, 127).to(torch.int8)
Popular Methods
- GPTQ — layer-wise weight quantization using approximate second-order information (Hessian).
- AWQ — activation-aware: protects salient weight channels that matter most to activations.
- SmoothQuant — migrates quantization difficulty from activations to weights via per-channel scaling.
Hardware Support
INT8 tensor cores on Ampere/Hopper deliver ~2× the throughput of FP16 tensor cores. FP8 tensor cores on Hopper push that further. The hardware is there — the software stack (TensorRT-LLM, vLLM, llama.cpp) is catching up.
Bottom line: INT8 is essentially free accuracy-wise for most models. INT4 requires careful method selection but enables running 70B+ models on consumer GPUs.