Quantization for Inference

Why Quantize?

Large language models are dominated by matrix multiplications over billions of parameters stored as FP16/BF16. Quantization reduces precision — typically to INT8 or INT4 — to cut memory footprint and increase throughput.

Precision	Bits	Memory vs FP16	Typical Accuracy Loss
FP16 / BF16	16	1×	baseline
INT8	8	0.5×	< 1 %
INT4 (GPTQ)	4	0.25×	1–3 %
FP8 (E4M3)	8	0.5×	< 0.5 %

Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model using a small calibration dataset. The key idea: learn per-channel or per-group scale factors that minimize quantization error.

# Pseudocode: symmetric per-channel quantization
scale = x.abs().max(dim=-1).values / 127
x_int8 = torch.round(x / scale.unsqueeze(-1)).clamp(-128, 127).to(torch.int8)

Popular Methods

GPTQ — layer-wise weight quantization using approximate second-order information (Hessian).
AWQ — activation-aware: protects salient weight channels that matter most to activations.
SmoothQuant — migrates quantization difficulty from activations to weights via per-channel scaling.

Hardware Support

INT8 tensor cores on Ampere/Hopper deliver ~2× the throughput of FP16 tensor cores. FP8 tensor cores on Hopper push that further. The hardware is there — the software stack (TensorRT-LLM, vLLM, llama.cpp) is catching up.

Bottom line: INT8 is essentially free accuracy-wise for most models. INT4 requires careful method selection but enables running 70B+ models on consumer GPUs.