LLM Quantization Explained — Run Big Models on Small Hardware

Last updated: June 23, 2026 · 12 min read

Quantization is the technique that lets you run a 70-billion parameter model on a single consumer GPU. By reducing weight precision from 16-bit to 4-bit, you can cut memory usage by 75% with minimal quality loss.

What is Quantization?

Quantization is the process of reducing the numerical precision of a model's weights. In standard training, neural network weights are stored as 32-bit floating point numbers (FP32) or 16-bit floating point numbers (FP16/BF16). Quantization converts these to lower-precision formats like 8-bit integers (INT8) or 4-bit integers (INT4).

Think of it like image compression: a RAW photo might be 50MB, but a high-quality JPEG at 5MB looks nearly identical to the human eye. Similarly, a quantized model uses less memory and runs faster, with minimal impact on output quality.

The math is straightforward: a single weight stored as FP16 takes 2 bytes. Stored as INT8, it takes 1 byte (50% reduction). Stored as INT4, it takes 0.5 bytes (75% reduction). For a 70B parameter model, this means:

That INT4 version can fit on a single GPU with 48GB of VRAM (like an RTX 4090 or A6000), while the FP16 version would need multiple high-end GPUs.

Why Quantize LLMs?

The motivation for quantization comes from a simple problem: LLMs are too big for consumer hardware.

Memory Constraints

Modern LLMs have billions of parameters. Even a "small" 7B model requires ~14GB just to load its weights in FP16. Add the KV cache, activations, and framework overhead, and you need ~20GB+ of VRAM. Most consumer GPUs have 8-24GB of VRAM.

Cost Reduction

Running LLMs in the cloud is expensive. A100 GPUs cost $2-4/hour. Quantized models can run on cheaper hardware — even on CPUs with enough RAM. For businesses, this can reduce inference costs by 50-75%.

Latency

Smaller data types mean less memory bandwidth is needed to load weights. Since LLM inference is often memory-bandwidth bound (not compute bound), quantization can actually speed up generation even on hardware that supports FP16 natively.

Edge Deployment

Quantization enables running LLMs on laptops, phones, and embedded devices. Models like Llama 3 8B can run on a MacBook Air with 16GB RAM when quantized to 4-bit.

How Quantization Works

At its core, quantization maps floating-point values to a smaller set of discrete values. Here's the basic idea:

Linear Quantization

The simplest approach is linear (or uniform) quantization. Given a range of floating-point values, we divide that range into equal intervals and map each value to the nearest interval:

For example, with INT4 quantization (16 levels), a weight of 0.73 might be mapped to the nearest of 16 discrete values, say 0.75.

Symmetric vs Asymmetric

Symmetric quantization assumes the data is centered around zero and uses the same scale for positive and negative values. Asymmetric quantization uses separate offsets, which works better when the data isn't centered around zero.

Per-Tensor vs Per-Channel

Quantization can be applied at different granularities:

Per-group quantization with group size 128 is the sweet spot — it provides good accuracy with manageable overhead.

Quantization Methods

There are two main categories of quantization: quantization-aware training (QAT) and post-training quantization (PTQ). For LLMs, PTQ is dominant because retraining is expensive.

INT8 Quantization

INT8 is the most straightforward approach. It reduces weights from 16-bit to 8-bit with minimal quality loss. The bitsandbytes library makes this trivially easy in PyTorch:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    load_in_8bit=True
)

INT8 is a great starting point — it halves memory usage with negligible quality impact. It's well-supported across frameworks.

GPTQ

GPTQ (GPT Quantization) is an advanced INT4 quantization method developed by the Efficient ML group at IST Austria. It uses approximate second-order information (Hessian) to determine the optimal way to quantize each weight while minimizing output error.

Key features of GPTQ:

GPTQ models are widely available on Hugging Face. Many community members quantize popular models and share them.

AWQ

AWQ (Activation-aware Weight Quantization) takes a different approach. Instead of looking at the weights directly, it examines which weight channels are most important based on activation patterns. These important channels are protected (quantized with higher precision or kept in FP16).

AWQ advantages:

The autoawq library provides easy-to-use AWQ quantization.

GGUF (llama.cpp)

GGUF (GPT-Generated Unified Format) is the format used by llama.cpp for CPU and mixed CPU/GPU inference. It supports various quantization levels:

FormatBits/WeightQualitySpeed
Q8_08.5Near-perfectFast
Q6_K6.6ExcellentFast
Q5_K_M5.7Very goodGood
Q4_K_M4.8GoodGood
Q3_K_M3.9AcceptableModerate
Q2_K2.9PoorSlow

Q4_K_M is the most popular choice — it offers the best balance of quality, size, and speed for most use cases.

Model Formats: GGUF, safetensors

Different frameworks use different file formats for quantized models:

GGUF

Used by llama.cpp, Ollama, LM Studio, and other CPU-friendly tools. GGUF files are self-contained — they include the model weights, tokenizer, and metadata in a single file. Great for local inference.

safetensors + config

Used by Hugging Face transformers, AutoGPTQ, and AWQ. The model is split across multiple files (model shards + config). This format is used for GPU inference with PyTorch-based frameworks.

Which to Choose?

Practical Guide

Option 1: Use Pre-Quantized Models

The easiest approach is to download models that are already quantized. On Hugging Face, search for "GGUF" or "GPTQ" versions of popular models. For Ollama:

# Download and run a quantized model
ollama run llama3:8b-q4_K_M

# Or for a specific quantization
ollama run llama3:8b-instruct-q5_K_M

Option 2: Quantize Yourself with GPTQ

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_id = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
)

model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
model.quantize(calibration_dataset)  # Provide calibration data
model.save_quantized("./llama3-8b-gptq")

Option 3: Create GGUF with llama.cpp

# Convert Hugging Face model to GGUF
python convert_hf_to_gguf.py /path/to/model --outtype q4_k_m

# Or quantize an existing GGUF
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quality vs Speed Trade-offs

The key question when choosing a quantization level is: how much quality am I willing to lose?

General Guidelines

When to Use Higher Precision

Some tasks are more sensitive to quantization than others:

Rule of thumb: if your model is 7B parameters, try Q5_K_M or Q4_K_M. For 13B+, Q4_K_M is usually the sweet spot. For 70B, Q3_K_M or Q4_K_M can be worth it if it's the only way to fit the model.

Frequently Asked Questions

What is LLM quantization?

LLM quantization is the process of reducing the precision of a model's weights from 16-bit or 32-bit floating point numbers to lower precision formats like 8-bit or 4-bit integers. This reduces memory usage and can speed up inference, allowing large models to run on smaller hardware.

Does quantization reduce model quality?

Yes, quantization typically causes some quality degradation, but the impact varies by method and level. INT8 quantization usually has negligible quality loss. INT4 quantization may cause noticeable degradation on complex reasoning tasks. Advanced methods like GPTQ and AWQ minimize quality loss through careful calibration.

What is the difference between GPTQ and AWQ?

Both are post-training quantization methods for INT4. GPTQ uses approximate second-order information to find optimal quantization. AWQ (Activation-aware Weight Quantization) identifies and protects the most important weight channels based on activation patterns. AWQ is generally faster and produces slightly better quality at the same bit width.

How much VRAM do I need for a quantized model?

Rough rule: a 7B model needs ~14GB at FP16, ~7GB at INT8, and ~4GB at INT4. A 13B model needs ~26GB at FP16, ~13GB at INT8, and ~7GB at INT4. A 70B model needs ~140GB at FP16, ~70GB at INT8, and ~35GB at INT4. These are approximate; actual usage varies by implementation.