Efficient Inference

Quantization

Shrink model memory 4-8× by storing weights in fewer bits — run giant models on a single GPU

The Deployment Problem

A 70-billion-parameter model stored in standard 16-bit precision needs about 140 GB of memory just to load the weights. A single consumer GPU has 24 GB. A data-center A100 has 80 GB. Even loading the model — before serving a single request — is impossible on most hardware. This is the deployment wall: models keep growing, but hardware budgets do not.

Quantization is the single most impactful technique for making large models runnable. The idea is simple: store each weight using fewer bits. Standard training uses 16 or 32 bits per number. If you use only 4 bits, you need 4× less memory. A 70B model in 4-bit fits in about 35 GB — achievable on a single 48 GB GPU. The trade-off is a small loss in accuracy, which modern methods keep remarkably low.

How LLMs Store Numbers

Every weight in a neural network is a number — a floating-point value. The number of bits used to store it determines both the memory cost and the precision (how many distinct values it can represent). The common formats, from most to least precise:

FP32 (32-bit float): the training default. ~4 billion distinct values. ~4 bytes per weight. FP16 / BF16 (16-bit): the inference default. ~65 thousand values. ~2 bytes. INT8 (8-bit integer): 256 distinct levels. ~1 byte. INT4 (4-bit integer): only 16 levels. ~0.5 bytes. The jump from FP16 to INT4 cuts memory by 4×. The question is: can 16 levels represent a weight accurately enough? Surprisingly, often yes — most weights cluster tightly around zero and tolerate coarse rounding.

The Quantization Math

Integer formats (INT8, INT4) cannot directly represent the fractional weights a model learns. Quantization maps the floating-point range to a small set of integer levels using a scale and a zero-point. The scale controls the step between adjacent levels; the zero-point shifts so that the value zero maps exactly to an integer (important for exact zeros in sparse weights).

# Quantize: map float weight w to integer q q = round(w / scale) + zero_point # store q (low-bit int) # Dequantize: recover approximate float at inference w ≈ (q - zero_point) × scale # Symmetric case (zero_point = 0): scale = max(|w|) / 127 # for INT8, levels are -127..127

The error introduced is the rounding error — the difference between the original weight and the nearest representable level. With 256 levels (INT8) this error is tiny. With 16 levels (INT4) it is larger, but clever methods (like grouping weights into small blocks and quantizing each block separately) keep the error low by adapting the scale to each block's value range.

Precision in 3D: Bit by Bit

The visualization below shows the four common formats as rows of bits. Each cube is one bit. More bits means more memory per weight, but also exponentially more representable values. Watch how INT4 (4 bits) uses 8× less memory than FP32 (32 bits) yet can still represent 16 distinct values — enough for most weights when done carefully. Hover or click a row to see its resolution.

Each cube = 1 bit. FP32 (32 bits, blue) → FP16 (16, green) → INT8 (8, amber) → INT4 (4, coral). Fewer bits = less memory, fewer representable values.

Memory Savings at Scale

The memory math is direct: memory = parameters × bytes-per-weight. Use the slider to pick a model size and watch how each precision shrinks the footprint. The practical milestone is the GPU memory wall: a 24 GB consumer card, a 48 GB card, an 80 GB data-center card. INT4 quantization is what lets a 70B model cross the 24 GB line that FP16 cannot.

Bar height = memory in GB for a given model size. Adjust the slider to change model size. INT4 (coral) is ~8× smaller than FP32 (blue).

Quantization Methods: PTQ vs QAT

There are two families of quantization, differing in when the quantization happens.

Post-Training Quantization (PTQ)

PTQ takes an already-trained model and quantizes its weights in a single pass — no further training. It is fast (minutes), needs no training data, and is what most users do. Popular PTQ formats: GGUF (used by llama.cpp, the standard for CPU/Apple Silicon inference), GPTQ (GPU-friendly, layer-by-layer), and AWQ (protects the most important "salient" weights, giving better accuracy at 4-bit).

Quantization-Aware Training (QAT)

QAT simulates quantization during training so the model learns to be robust to the rounding noise. It yields the best accuracy at aggressive bit-widths but requires full training infrastructure and data. Most open-source users never need QAT — modern PTQ methods like AWQ and GGUF already preserve >98% of quality at 4-bit.

A crucial practical detail: activation quantization is harder than weight quantization. Weights are static (known at load time), but activations vary with each input and can have extreme outliers. Methods like SmoothQuant handle this by mathematically "smoothing" outlier activations before quantizing, enabling fast INT8 inference without accuracy collapse.

The Accuracy Cost

Quantization is not free — each weight carries a small rounding error. The scene below visualizes this. The dense cloud is the true weight distribution; the vertical grid lines are the levels each format can represent. INT4 snaps each weight to one of only 16 levels, introducing visible error. But because most weights cluster near zero, the average error stays small, and benchmark accuracy typically drops less than 1-2%.

Weights (points) snapped to representable levels (grid planes). INT4 = 16 levels (larger gaps, more error). INT8 = 256 levels (tight gaps, near-lossless).

Running Models on Consumer Hardware

Quantization is why open-source LLMs are everywhere. The ecosystem that made this possible:

llama.cpp & GGUF

llama.cpp is a pure C/C++ inference engine that runs quantized models on CPU and Apple Silicon. Its GGUF format stores weights in 2-bit to 8-bit blocks and can offload layers to GPU when available. This is how people run Llama 3 70B on a MacBook with 64 GB of unified memory — impossible without 4-bit quantization. GGUF is the lingua franca of local LLM tools (Ollama, LM Studio, KoboldCpp all use it).

vLLM & GPU Serving

On the server side, vLLM loads AWQ- or GPTQ-quantized models onto GPUs for high-throughput serving. Combined with PagedAttention (see the KV Cache module) and continuous batching, a single quantized model can serve thousands of concurrent requests. Quantization roughly doubles throughput by halving memory bandwidth pressure — often the real bottleneck, not raw compute.

QLoRA: Train What You Cannot Load

Quantization also unlocks training. QLoRA loads a model in 4-bit (to fit in memory) while training tiny 16-bit LoRA adapters on top. This lets you fine-tune a 65B model on a single 48 GB GPU — see the LoRA module for the full story. Quantization and LoRA together are the foundation of accessible large-model customization.

Choosing a Precision

There is no single best precision — it depends on your hardware and accuracy needs. The table below summarizes the practical trade-offs for deploying a 7B model.

Format 7B Memory Accuracy Best For
FP32~28 GB100%Training only
FP16 / BF16~14 GB100%Server inference (baseline)
INT8~7 GB~99%High-accuracy serving
INT4 (AWQ/GPTQ)~3.5 GB~97-98%Consumer GPU / max throughput
INT4 (GGUF)~3.5 GB~96-97%CPU / Apple Silicon local

A good rule of thumb: use FP16 when memory allows, INT8 for a safe 2× compression, and INT4 (AWQ for GPU, GGUF for CPU) when you need to fit a model that otherwise would not run. The accuracy gap between INT4 and FP16 is now small enough that for most applications, users cannot tell the difference.

Key Takeaways

1

Quantization stores weights in fewer bits (INT8/INT4 instead of FP16/FP32), cutting memory 2-8× with a small accuracy cost.

2

Each weight is mapped to integer levels via a scale and zero-point: q = round(w/scale) + zero_point. The rounding error is the price of compression.

3

PTQ (GGUF/GPTQ/AWQ) quantizes a trained model in minutes without data; QAT simulates quantization during training for best accuracy but needs full training.

4

INT4 quantization is what makes 70B models runnable on a single GPU and is the foundation of the local-LLM ecosystem (llama.cpp, Ollama, LM Studio).

5

Accuracy loss at 4-bit is now typically under 2-3% thanks to methods like AWQ (protecting salient weights) and per-block scaling — small enough that most users do not notice.

Explore Related Topics

Continue your journey through LLM fundamentals: