Quantization

In Short

Quantization shrinks AI model weights from high-precision floating-point numbers down to lower-precision integers, dramatically cutting memory and speeding inference. The tradeoff is a small loss in output quality that grows as precision drops. GGUF, GPTQ, and AWQ are the three dominant formats for deploying quantized models in 2026.

01. What It Is

Every weight in a neural network is a number. By default, modern LLMs store those numbers in BF16 format, which uses 2 bytes per parameter. A 70-billion-parameter model therefore needs roughly 140 GB of memory just to load. Quantization replaces each weight with a lower-precision approximation: INT8 uses 1 byte, INT4 uses half a byte. The same 70B model drops to about 35-40 GB at 4-bit precision, fitting inside a machine that was previously impossible.

The core operation is mapping a continuous range of floating-point values onto a smaller set of discrete integers. A scale factor and zero-point value are stored per block of weights so that the original range can be approximately recovered at inference time.

02. Why It Matters

Without quantization, running a capable open-weight model locally requires datacenter-grade hardware. A Llama-3.1-70B at BF16 needs two high-end GPUs. The same model at Q4_K_M drops to roughly 35-40 GB, which fits a single professional GPU, splits across two consumer cards, or runs with CPU+RAM offloading on a high-memory desktop, and smaller models in the 8B-30B class fit comfortably on a single consumer GPU. This is the primary reason local LLM inference became practical for individuals between 2023 and 2026.

On the server side, quantization lets cloud providers fit more model replicas onto the same GPU, cutting per-token cost. At 4-bit precision, a single H100 can serve roughly four times as many concurrent requests as it could at BF16.

03. How It Works

The standard pipeline is called post-training quantization (PTQ): take an already-trained model and compress it without retraining. The key challenge is that weight distributions are not uniform. A small fraction of weights carry outsized importance. Naive uniform rounding damages those weights disproportionately.

Three methods dominate in 2026.

GPTQ (Frantar et al., 2022) quantizes one layer at a time. For each layer, it uses second-order (Hessian) information about how each weight affects the output error, then compensates for the rounding error of each weight by adjusting the remaining weights in the same row. This layer-wise error correction is why GPTQ achieves better quality than naive rounding at the same bit-width.

AWQ (Lin et al., 2023) takes a different approach. It first runs a calibration set through the model to observe which input channels produce large activations. Those channels are considered salient. Before quantizing, AWQ scales up the salient weight channels and scales down the corresponding activations by the same factor, so the mathematical output is unchanged but the weight values are now in a range that survives low-precision rounding much better. All weights are then quantized uniformly. AWQ generally matches or slightly beats GPTQ quality at INT4 and is a popular format for production inference with frameworks like vLLM.

GGUF is not a quantization algorithm but a file container format used by llama.cpp and Ollama. A GGUF file can hold weights at the format's discrete quantization levels, from Q2_K up to Q8_0. The format supports mixed precision within a model, keeping the precision-sensitive tensors such as embeddings, layer norms, and parts of the attention and feed-forward weights at higher precision while the rest is more aggressively quantized. GGUF also enables CPU+GPU hybrid inference, offloading some layers to GPU VRAM and running the rest on CPU RAM. This is slower than pure-GPU inference but makes large models accessible on machines with modest VRAM.

FP8 is an emerging format supported by NVIDIA Hopper and Blackwell GPUs. It offers near-BF16 quality at half the memory, and is increasingly used in high-throughput production deployments where hardware supports it.

04. Key Terms and Variants

FP32:
4 bytes per parameter. Mainly appears in training, especially optimizer states, and is rarely used for inference. No quality advantage over BF16 for inference.

BF16:
2 bytes per parameter. The current standard baseline for inference. Preserves the dynamic range of FP32 with fewer mantissa bits.

FP16:
2 bytes per parameter. Functionally equivalent to BF16 for most inference purposes but has a smaller dynamic range, which can cause overflow on very large activations.

INT8:
1 byte per parameter. Approximately 97-98% quality retention vs. BF16. Widely used for production serving.

INT4:
0.5 bytes per parameter. Approximately 94-96% quality vs. BF16. Noticeable degradation on complex reasoning tasks. Still usable for most chat and instruction-following.

NF4:
A 4-bit data type that is information-theoretically optimal for normally distributed weights. Used specifically in QLoRA fine-tuning, not general inference.

K-quants:
A GGUF-specific enhancement that organizes weights into super-blocks of 256 values with hierarchical scale metadata. Provides better quality per bit than legacy GGUF quantization types.

GGUF naming convention:
The format is Q[bits]_[method]_[size]. For example, Q4_K_M means 4-bit precision, K-quant method, medium size variant. The _S, _M, _L suffixes within a family represent small/medium/large variants that trade off compression ratio against fidelity. Q4_K_M is the general-purpose sweet spot recommended by the llama.cpp maintainers for most use cases.

05. Examples

A Llama-3.1-8B model at BF16 requires about 16 GB. The Q4_K_M quantized version is approximately 4.8 GB and fits in the VRAM of a single mid-range laptop GPU. Perplexity degradation (a standard quality metric) increases by roughly 0.05 ppl for Q4_K_M versus Q5_K_M at the 7-8B scale, a difference most users cannot detect in casual use.

For a 70B model, the quality difference between bit levels is even smaller. Larger models have more redundancy in their weights, so aggressive quantization causes proportionally less damage. A Q4_K_M Llama-3.3-70B routinely outperforms an unquantized 7B model on benchmarks.

DeepSeek-R1 and Qwen3 distilled models are distributed primarily as GGUF files in the Q4-Q8 range for local deployment.

06. Common Pitfalls

Conflating format with method:
GGUF is a container format, not a quantization algorithm. A Q8_0 GGUF file is much higher quality than a Q2_K GGUF file. The quantization level is what matters for quality, not the format name.

Reasoning task degradation:
Quality benchmarks for general text generation understate the impact on multi-step reasoning. INT4 models show measurable regression on math, code, and chain-of-thought tasks even when they score near-identically on MMLU. A recent arXiv study (April 2025, arXiv 2504.04823) specifically documented that reasoning models are more sensitive to quantization than chat models.

Outlier weights in larger models:
Beyond roughly 6.7 billion parameters, LLMs develop weight distributions with significant outliers. Naive INT8 quantization of these models causes sudden quality collapse. GPTQ and AWQ both include mechanisms to handle outliers, which is why they outperform naive rounding at scale.

Hardware compatibility:
AWQ requires kernels optimized for specific GPU architectures. Using AWQ on unsupported hardware produces correct output but at no speedup. GGUF is the more hardware-agnostic option.

Single-request vs. batch throughput. Some quantization formats (particularly SmoothQuant and LLM.int8()) provide throughput gains only when processing multiple requests simultaneously. They may not speed up a single inference call.

In Short

01. What It Is

02. Why It Matters

03. How It Works

04. Key Terms and Variants

05. Examples

06. Common Pitfalls

Verified against primary sources

Key terms

Tags

Sources

More in Smaller & Faster