LLM Inference Optimization

The techniques that decide whether your model serves at 20 tokens/sec or 200. Attention kernels, KV cache compression, prefix reuse, and speculative decoding — the full stack of optimizations that ship with modern inference engines.

Three bottlenecks, three families of fixes

LLM inference has three recurring bottlenecks, and this cluster covers the optimization technique for each. The attention computation is memory-bound and quadratic in sequence length — Flash Attention fixes this with tiled kernels that never materialize the full N×N matrix. The KV cache dominates VRAM at long context — quantization (FP8/INT4), compression (MLA), and prefix reuse (caching) shrink it. The decode loop is serial and underutilizes the GPU — speculative decoding parallelizes it with a draft model.

How these compose

These optimizations are not alternatives — they stack. A production serving stack running vLLM or SGLang uses Flash Attention for the attention kernel, quantizes the KV cache to fit longer context, enables automatic prefix caching to skip redundant prefill, and adds speculative decoding to speed the decode phase. The combined effect is what separates a 5x slowdown from baseline serving. Each technique is compatible with the others and with every engine in our inference cluster.

Flash Attention Explained

The tiled attention kernel that cut memory from O(N²) to O(N) and delivered 1.5-2x speedup on H100. How FA1, FA2, and FA3 differ.

Deep Dive

Prefix Caching in vLLM & SGLang

How automatic prefix caching skips redundant prefill — block-level hashing (vLLM) vs radix tree (SGLang), and when each wins.

Deep Dive

Speculative Decoding

Draft-verify decoding that delivers 2-3x decode throughput with zero quality loss — Medusa, EAGLE, and production deployment.

Deep Dive

MLA Attention (DeepSeek)

Multi-Head Latent Attention — DeepSeek's KV cache compression that cuts memory 93% vs standard MHA while keeping quality.

Deep Dive

Ring Attention

Scaling context to millions of tokens across GPUs — sequence parallelism and the ring topology for long-context serving.

Deep Dive

KV Cache Quantization (FP8/INT4)

How FP8 and INT4 KV cache quantization slash memory 2-8x with under 1% quality drop. Implementation in vLLM and SGLang.

Deep Dive

KV Cache Calculator

Compute the exact KV cache memory footprint for any model — plan your GPU memory budget before deploying.

Tool

Quantization Calculator

Compare model size and memory across FP16, BF16, INT8, and INT4 — see exactly how much you save before deploying.

Tool

Chunked Prefill for Long Prompts

Splitting prefill into chunks to interleave with decode — better latency for long inputs.

Coming soon

Related clusters