01. What It Is
Every weight in a neural network is a number. By default, modern LLMs store those numbers in BF16 format, which uses 2 bytes per parameter. A 70-billion-parameter model therefore needs roughly 140 GB of memory just to load. Quantization replaces each weight with a lower-precision approximation: INT8 uses 1 byte, INT4 uses half a byte. The same 70B model drops to about 35-40 GB at 4-bit precision, fitting inside a machine that was previously impossible.
The core operation is mapping a continuous range of floating-point values onto a smaller set of discrete integers. A scale factor and zero-point value are stored per block of weights so that the original range can be approximately recovered at inference time.