LLM Quantization Explained — Run Big Models on Small Hardware
Quantization is the technique that lets you run a 70-billion parameter model on a single consumer GPU. By reducing weight precision from 16-bit to 4-bit, you can cut memory usage by 75% with minimal quality loss.
What is Quantization?
Quantization is the process of reducing the numerical precision of a model's weights. In standard training, neural network weights are stored as 32-bit floating point numbers (FP32) or 16-bit floating point numbers (FP16/BF16). Quantization converts these to lower-precision formats like 8-bit integers (INT8) or 4-bit integers (INT4).
Think of it like image compression: a RAW photo might be 50MB, but a high-quality JPEG at 5MB looks nearly identical to the human eye. Similarly, a quantized model uses less memory and runs faster, with minimal impact on output quality.
The math is straightforward: a single weight stored as FP16 takes 2 bytes. Stored as INT8, it takes 1 byte (50% reduction). Stored as INT4, it takes 0.5 bytes (75% reduction). For a 70B parameter model, this means:
- FP16: ~140 GB of memory
- INT8: ~70 GB of memory
- INT4: ~35 GB of memory
That INT4 version can fit on a single GPU with 48GB of VRAM (like an RTX 4090 or A6000), while the FP16 version would need multiple high-end GPUs.
Why Quantize LLMs?
The motivation for quantization comes from a simple problem: LLMs are too big for consumer hardware.
Memory Constraints
Modern LLMs have billions of parameters. Even a "small" 7B model requires ~14GB just to load its weights in FP16. Add the KV cache, activations, and framework overhead, and you need ~20GB+ of VRAM. Most consumer GPUs have 8-24GB of VRAM.
Cost Reduction
Running LLMs in the cloud is expensive. A100 GPUs cost $2-4/hour. Quantized models can run on cheaper hardware — even on CPUs with enough RAM. For businesses, this can reduce inference costs by 50-75%.
Latency
Smaller data types mean less memory bandwidth is needed to load weights. Since LLM inference is often memory-bandwidth bound (not compute bound), quantization can actually speed up generation even on hardware that supports FP16 natively.
Edge Deployment
Quantization enables running LLMs on laptops, phones, and embedded devices. Models like Llama 3 8B can run on a MacBook Air with 16GB RAM when quantized to 4-bit.
How Quantization Works
At its core, quantization maps floating-point values to a smaller set of discrete values. Here's the basic idea:
Linear Quantization
The simplest approach is linear (or uniform) quantization. Given a range of floating-point values, we divide that range into equal intervals and map each value to the nearest interval:
- Find the min and max values in the weight tensor
- Divide the range into 2^n intervals (where n is the target bit width)
- Map each weight to the nearest interval center
For example, with INT4 quantization (16 levels), a weight of 0.73 might be mapped to the nearest of 16 discrete values, say 0.75.
Symmetric vs Asymmetric
Symmetric quantization assumes the data is centered around zero and uses the same scale for positive and negative values. Asymmetric quantization uses separate offsets, which works better when the data isn't centered around zero.
Per-Tensor vs Per-Channel
Quantization can be applied at different granularities:
- Per-tensor: One scale factor for the entire weight matrix (fastest, least accurate)
- Per-channel: Separate scale factors for each output channel (better accuracy)
- Per-group: Separate scale factors for groups of weights (best accuracy, used by GPTQ/AWQ)
Per-group quantization with group size 128 is the sweet spot — it provides good accuracy with manageable overhead.
Quantization Methods
There are two main categories of quantization: quantization-aware training (QAT) and post-training quantization (PTQ). For LLMs, PTQ is dominant because retraining is expensive.
INT8 Quantization
INT8 is the most straightforward approach. It reduces weights from 16-bit to 8-bit with minimal quality loss. The bitsandbytes library makes this trivially easy in PyTorch:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
load_in_8bit=True
)
INT8 is a great starting point — it halves memory usage with negligible quality impact. It's well-supported across frameworks.
GPTQ
GPTQ (GPT Quantization) is an advanced INT4 quantization method developed by the Efficient ML group at IST Austria. It uses approximate second-order information (Hessian) to determine the optimal way to quantize each weight while minimizing output error.
Key features of GPTQ:
- Uses calibration data (a small dataset) to find optimal quantization
- Processes weights column by column, compensating for quantization error
- Achieves INT4 quality close to INT8
- Fast inference with specialized CUDA kernels (via AutoGPTQ or ExLlamaV2)
GPTQ models are widely available on Hugging Face. Many community members quantize popular models and share them.
AWQ
AWQ (Activation-aware Weight Quantization) takes a different approach. Instead of looking at the weights directly, it examines which weight channels are most important based on activation patterns. These important channels are protected (quantized with higher precision or kept in FP16).
AWQ advantages:
- Doesn't need backpropagation or gradient computation
- Works well with limited calibration data
- Fast quantization process
- Excellent quality at INT4, especially for smaller models
The autoawq library provides easy-to-use AWQ quantization.
GGUF (llama.cpp)
GGUF (GPT-Generated Unified Format) is the format used by llama.cpp for CPU and mixed CPU/GPU inference. It supports various quantization levels:
| Format | Bits/Weight | Quality | Speed |
|---|---|---|---|
| Q8_0 | 8.5 | Near-perfect | Fast |
| Q6_K | 6.6 | Excellent | Fast |
| Q5_K_M | 5.7 | Very good | Good |
| Q4_K_M | 4.8 | Good | Good |
| Q3_K_M | 3.9 | Acceptable | Moderate |
| Q2_K | 2.9 | Poor | Slow |
Q4_K_M is the most popular choice — it offers the best balance of quality, size, and speed for most use cases.
Model Formats: GGUF, safetensors
Different frameworks use different file formats for quantized models:
GGUF
Used by llama.cpp, Ollama, LM Studio, and other CPU-friendly tools. GGUF files are self-contained — they include the model weights, tokenizer, and metadata in a single file. Great for local inference.
safetensors + config
Used by Hugging Face transformers, AutoGPTQ, and AWQ. The model is split across multiple files (model shards + config). This format is used for GPU inference with PyTorch-based frameworks.
Which to Choose?
- Running on CPU or mixed CPU/GPU: Use GGUF with llama.cpp or Ollama
- Running on GPU with Python: Use GPTQ/AWQ safetensors with transformers or vLLM
- Maximum portability: GGUF (single file, works everywhere)
Practical Guide
Option 1: Use Pre-Quantized Models
The easiest approach is to download models that are already quantized. On Hugging Face, search for "GGUF" or "GPTQ" versions of popular models. For Ollama:
# Download and run a quantized model
ollama run llama3:8b-q4_K_M
# Or for a specific quantization
ollama run llama3:8b-instruct-q5_K_M
Option 2: Quantize Yourself with GPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_id = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
damp_percent=0.01,
)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
model.quantize(calibration_dataset) # Provide calibration data
model.save_quantized("./llama3-8b-gptq")
Option 3: Create GGUF with llama.cpp
# Convert Hugging Face model to GGUF
python convert_hf_to_gguf.py /path/to/model --outtype q4_k_m
# Or quantize an existing GGUF
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
Quality vs Speed Trade-offs
The key question when choosing a quantization level is: how much quality am I willing to lose?
General Guidelines
- INT8 / Q8_0: Negligible quality loss. Use if you have the memory.
- INT4 (Q4_K_M): Small quality loss, usually acceptable for most tasks. Best balance for constrained hardware.
- INT3 / Q3: Noticeable quality loss. Only use if you must fit a larger model.
- INT2 / Q2: Significant quality degradation. Not recommended for production.
When to Use Higher Precision
Some tasks are more sensitive to quantization than others:
- Math/Code: More sensitive — quantization can break precise computations
- Creative writing: Less sensitive — quantized models often perform well
- Reasoning: Moderately sensitive — complex chains of thought degrade more
- Translation: Less sensitive — good even at INT4
Rule of thumb: if your model is 7B parameters, try Q5_K_M or Q4_K_M. For 13B+, Q4_K_M is usually the sweet spot. For 70B, Q3_K_M or Q4_K_M can be worth it if it's the only way to fit the model.
Frequently Asked Questions
What is LLM quantization?
LLM quantization is the process of reducing the precision of a model's weights from 16-bit or 32-bit floating point numbers to lower precision formats like 8-bit or 4-bit integers. This reduces memory usage and can speed up inference, allowing large models to run on smaller hardware.
Does quantization reduce model quality?
Yes, quantization typically causes some quality degradation, but the impact varies by method and level. INT8 quantization usually has negligible quality loss. INT4 quantization may cause noticeable degradation on complex reasoning tasks. Advanced methods like GPTQ and AWQ minimize quality loss through careful calibration.
What is the difference between GPTQ and AWQ?
Both are post-training quantization methods for INT4. GPTQ uses approximate second-order information to find optimal quantization. AWQ (Activation-aware Weight Quantization) identifies and protects the most important weight channels based on activation patterns. AWQ is generally faster and produces slightly better quality at the same bit width.
How much VRAM do I need for a quantized model?
Rough rule: a 7B model needs ~14GB at FP16, ~7GB at INT8, and ~4GB at INT4. A 13B model needs ~26GB at FP16, ~13GB at INT8, and ~7GB at INT4. A 70B model needs ~140GB at FP16, ~70GB at INT8, and ~35GB at INT4. These are approximate; actual usage varies by implementation.