Quantization for Inference

INT8, FP8, GPTQ, AWQ — shrinking model weights for faster and cheaper inference.

quantizationinferenceint8llm

Why Quantize?

Large language models are dominated by matrix multiplications over billions of parameters stored as FP16/BF16. Quantization reduces precision — typically to INT8 or INT4 — to cut memory footprint and increase throughput.

PrecisionBitsMemory vs FP16Typical Accuracy Loss
FP16 / BF1616baseline
INT880.5×< 1 %
INT4 (GPTQ)40.25×1–3 %
FP8 (E4M3)80.5×< 0.5 %

Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model using a small calibration dataset. The key idea: learn per-channel or per-group scale factors that minimize quantization error.

# Pseudocode: symmetric per-channel quantization
scale = x.abs().max(dim=-1).values / 127
x_int8 = torch.round(x / scale.unsqueeze(-1)).clamp(-128, 127).to(torch.int8)
  • GPTQ — layer-wise weight quantization using approximate second-order information (Hessian).
  • AWQ — activation-aware: protects salient weight channels that matter most to activations.
  • SmoothQuant — migrates quantization difficulty from activations to weights via per-channel scaling.

Hardware Support

INT8 tensor cores on Ampere/Hopper deliver ~2× the throughput of FP16 tensor cores. FP8 tensor cores on Hopper push that further. The hardware is there — the software stack (TensorRT-LLM, vLLM, llama.cpp) is catching up.

Bottom line: INT8 is essentially free accuracy-wise for most models. INT4 requires careful method selection but enables running 70B+ models on consumer GPUs.