Open Source · Interactive · Free Forever

Understand How Large Language Models Think

No math PhD needed. Interactive 3D visualizations make complex LLM concepts intuitive.

15+ free lessons from tokenization to speculative decoding. Click, explore, and learn at your own pace.

Start from Scratch →

Explore Modules

Understand BPE, WordPiece, and SentencePiece — how text becomes tokens for LLMs.

📐

Embedding

嵌入层

Visualize how tokens become vectors in high-dimensional space and why similarity matters.

🔄

Transformer Architecture

Transformer 架构

Explore the complete Transformer architecture in 3D — Encoder, Decoder, Attention, and data flow.

👁️

Attention Variants

注意力变体

Multi-Head, Grouped Query, Sliding Window, and Flash Attention — the evolution of attention.

⚡

KV Cache

KV 缓存

How KV caching accelerates autoregressive inference and the PagedAttention optimization.

🏋️

Training Pipeline

训练流程

Pre-training, supervised fine-tuning, and the complete training loop with loss landscapes.

🔧

LoRA

LoRA 微调

Low-Rank Adaptation — efficient parameter-efficient fine-tuning for large models.

🎯

RLHF

RLHF 对齐

Reinforcement Learning from Human Feedback — aligning LLMs with human preferences.

📚

RAG

检索增强生成

Retrieval-Augmented Generation — grounding LLM outputs in external knowledge sources.

🧠

Mixture of Experts

混合专家

Sparse expert routing — how Mixtral, DeepSeek & GPT-4 scale to huge capacity with constant per-token compute.

🗜️

Quantization

量化

FP32, INT8, INT4 & GGUF — how low-bit quantization shrinks models 4-8× to run on a single GPU.

Calculator

KV Cache Memory Calculator

Pick any LLM and context length — instantly see the VRAM needed for the KV cache. Supports MHA, GQA, MQA, and DeepSeek's MLA.

12 models · 4 precisions

Comparison

MHA vs MQA vs GQA vs MLA

Side-by-side comparison of the four attention mechanisms — KV cache savings, quality tradeoffs, and a decision guide for which to use.

Interactive chart

Calculator

LLM Model Size Calculator

How big is a 70B model in INT4? Enter any model and precision — see the disk size and which GPUs can run it.

14 models · GPU fit

Deep Dive

Multi-Head Latent Attention (MLA)

多头潜在注意力

How DeepSeek compresses the KV cache into a single latent vector — 9× memory reduction.

Advanced DeepSeek-V2/V3

Deep Dive

KV Cache Quantization

KV 缓存量化

FP8 and INT4 KV cache quantization — halving serving memory without quality collapse.

Intermediate FP8 · INT4 · KIVI

Deep Dive

Speculative Decoding

投机解码

A tiny draft model guesses tokens, the large model verifies — 2-3× faster with zero quality loss.

Intermediate Draft · Verify · Medusa

Deep Dive

Ring Attention

环形注意力

How million-token contexts split across GPUs in a ring. Powers Llama 3 & Gemini.

Advanced Context Parallelism

View all articles →

Understand How Large Language Models Think

Your Learning Journey

Interactive Tools

Go Deeper