Understand How Large Language Models Think
No math PhD needed. Interactive 3D visualizations make complex LLM concepts intuitive.
15+ free lessons from tokenization to speculative decoding. Click, explore, and learn at your own pace.
Start from Scratch →Your Learning Journey
Follow the recommended order, or jump to any topic
Understand BPE, WordPiece, and SentencePiece — how text becomes tokens for LLMs.
2Visualize how tokens become vectors in high-dimensional space and why similarity matters.
3Explore the complete Transformer architecture in 3D — Encoder, Decoder, Attention, and data flow.
4Multi-Head, Grouped Query, Sliding Window, and Flash Attention — the evolution of attention.
5How KV caching accelerates autoregressive inference and the PagedAttention optimization.
6Pre-training, supervised fine-tuning, and the complete training loop with loss landscapes.
7Low-Rank Adaptation — efficient parameter-efficient fine-tuning for large models.
8Reinforcement Learning from Human Feedback — aligning LLMs with human preferences.
9Retrieval-Augmented Generation — grounding LLM outputs in external knowledge sources.
10Sparse expert routing — how Mixtral, DeepSeek & GPT-4 scale to huge capacity with constant per-token compute.
11FP32, INT8, INT4 & GGUF — how low-bit quantization shrinks models 4-8× to run on a single GPU.
Interactive Tools
Free calculators and side-by-side comparisons — bookmark these and come back whenever you need them.
Pick any LLM and context length — instantly see the VRAM needed for the KV cache. Supports MHA, GQA, MQA, and DeepSeek's MLA.
ComparisonSide-by-side comparison of the four attention mechanisms — KV cache savings, quality tradeoffs, and a decision guide for which to use.
CalculatorHow big is a 70B model in INT4? Enter any model and precision — see the disk size and which GPUs can run it.
Go Deeper
Frontier techniques from recent research papers, illustrated in 3D. For readers who've mastered the fundamentals — or anyone curious about what's next.
How DeepSeek compresses the KV cache into a single latent vector — 9× memory reduction.
Deep DiveFP8 and INT4 KV cache quantization — halving serving memory without quality collapse.
Deep DiveA tiny draft model guesses tokens, the large model verifies — 2-3× faster with zero quality loss.
Deep DiveHow million-token contexts split across GPUs in a ring. Powers Llama 3 & Gemini.