LLM Inference Optimization

Three bottlenecks, three families of fixes

LLM inference has three recurring bottlenecks, and this cluster covers the optimization technique for each. The attention computation is memory-bound and quadratic in sequence length — Flash Attention fixes this with tiled kernels that never materialize the full N×N matrix. The KV cache dominates VRAM at long context — quantization (FP8/INT4), compression (MLA), and prefix reuse (caching) shrink it. The decode loop is serial and underutilizes the GPU — speculative decoding parallelizes it with a draft model.

How these compose

These optimizations are not alternatives — they stack. A production serving stack running vLLM or SGLang uses Flash Attention for the attention kernel, quantizes the KV cache to fit longer context, enables automatic prefix caching to skip redundant prefill, and adds speculative decoding to speed the decode phase. The combined effect is what separates a 5x slowdown from baseline serving. Each technique is compatible with the others and with every engine in our inference cluster.

LLM Inference Optimization

Three bottlenecks, three families of fixes

How these compose

Flash Attention Explained

Prefix Caching in vLLM & SGLang

Speculative Decoding

MLA Attention (DeepSeek)

Ring Attention

KV Cache Quantization (FP8/INT4)

KV Cache Calculator

Quantization Calculator

Chunked Prefill for Long Prompts

Related clusters