The Scaling Problem

To make language models smarter, you add more parameters. A 7-billion-parameter model beats a 1-billion one; a 70-billion model beats the 7-billion. This relationship is remarkably consistent — capability grows with scale. But there is a catch: every extra parameter costs compute. A dense model must run every parameter for every token it generates. Doubling the parameters doubles the inference cost and halves the speed.

This creates a painful trade-off. You want the capacity of a trillion-parameter model, but you cannot afford the trillion-parameter compute at inference time. Training is expensive but happens once; inference happens on every single request, millions of times per day. Mixture of Experts (MoE) breaks this trade-off by making compute conditional — only a fraction of the model runs for each token.

What Is a Mixture of Experts?

In a standard Transformer, each Feed-Forward Network (FFN) layer is a single block of neurons that processes every token. In a Mixture of Experts layer, that single FFN is replaced by several parallel FFNs called experts. A small neural network called the router (or gate) looks at each token and decides which experts should process it.

The key idea is sparse activation: for each token, only the top-k experts (often k=2) are activated. A model with 8 experts and top-2 routing has 8× the parameters of a single FFN, but only runs 2/8 = 25% of them per token. You get the capacity of a huge model at a fraction of the compute. This is called conditional computation — different tokens take different paths through the network.

The Router: A Tiny Network with Big Responsibility

The router is deceptively simple. It is a single linear layer that maps the token's hidden state to a score for each expert, followed by a softmax. Then it picks the top-k experts with the highest scores. The token is sent only to those experts, and their outputs are combined weighted by the router scores.

G(x) = Softmax(x · W_g) # scores for all N experts top_k = TopK(G(x), k) # pick k highest scores y = Σ_{i ∈ top_k} G(x)_i · E_i(x) # weighted sum of expert outputs

Despite its simplicity, the router is the heart of an MoE model. A bad routing decision wastes capacity — an expert specialized in code gets a math token and produces noise. Training the router well is what makes or breaks an MoE model. This is why techniques like load balancing (keeping all experts busy) and router noise during training are so important.

Expert Routing in 3D

The visualization below shows a single MoE layer. A token enters at the top, the router computes scores for all 8 experts, and energy beams flow to the selected top-k experts. Use the slider to change how many experts activate per token — watch how sparse routing keeps most experts idle while still leveraging the full model capacity. Click any expert to learn about it.

Token (coral) → router (gold) → 8 experts. Bright beams = active experts; dimmed blocks = idle. Adjust Top-k to change sparsity.

Sparse vs Dense: Compute at a Glance

The contrast between dense and sparse activation is the whole point of MoE. On the left, a dense FFN activates every neuron for every token — full capacity, full cost. On the right, an 8-expert MoE with top-2 routing activates only the two selected experts. The total parameters are larger, but the active parameters per token are far smaller. This is how Mixtral 8×7B (47B total parameters) runs at the speed of a 13B dense model.

Left: dense FFN — all neurons active (coral). Right: 8-expert MoE, top-2 — only selected experts glow (cyan). Same output quality, ~4× less compute per token.

Load Balancing: Keeping All Experts Busy

A naive MoE has a fatal flaw: routing collapse. The router quickly learns that a few experts work well and sends everything to them. The other experts starve, receive no gradient, and never improve. The model effectively shrinks back to a small dense model — wasting all those extra parameters.

The solution is an auxiliary load-balancing loss. During training, the model is penalized when the fraction of tokens routed to each expert is uneven. This gently pushes the router to distribute tokens uniformly across experts. Most MoE models also add small noise to the router logits during training to encourage exploration. Together, these techniques ensure every expert specializes and contributes.

L_aux = N · Σ_{i=1}^{N} f_i · P_i # f_i = fraction of tokens routed to expert i # P_i = average router probability for expert i # N = number of experts. Minimized when all f_i equal → balanced.

Expert Specialization

A well-trained MoE develops emergent specialization. Without anyone telling it to, different experts learn to handle different types of tokens. Researchers analyzing Mixtral found that some experts activate heavily on syntax, others on specific languages, others on code, and others on mathematical reasoning. The router learns this routing implicitly from the data — no manual assignment needed.

Each expert (colored cluster) specializes in a different domain. Hover or click to see what a trained router sends to each one.

MoE in Practice: Mixtral, DeepSeek & GPT-4

MoE powers many of the most capable models today. Each pushes the technique in a different direction, revealing what works at scale.

Mixtral 8×7B — The Open Standard

Released by Mistral AI in late 2023, Mixtral has 8 experts per layer with top-2 routing. Its 47 billion total parameters run at the inference cost of a 13-billion dense model. It matches or beats Llama 2 70B across most benchmarks while being 4× cheaper to run. Mixtral proved that sparse MoE is practical for open-source deployment, not just proprietary labs.

DeepSeekMoE — Fine-Grained Experts

DeepSeek took a different approach: instead of 8 large experts, use many small experts (e.g., 64 experts with top-6 routing). Smaller experts specialize more finely and combine more flexibly. DeepSeek also reserves some "shared experts" that activate on every token to capture common knowledge, avoiding redundancy. This fine-grained design achieves better quality per parameter than coarse MoE.

GPT-4 — MoE at the Frontier

While OpenAI has not officially confirmed the details, multiple reports indicate GPT-4 is a massive MoE with 8 routing groups (16 experts each), totaling roughly 1.8 trillion parameters. Only about 280 billion are active per token (~15%). This is how GPT-4 achieves frontier capability while remaining servable: the vast majority of its parameters stay dormant on any given request.

The MoE Trade-offs

MoE is not free. While inference compute per token stays low, the total memory footprint is large — all experts must reside in memory even when idle. A 47B-parameter MoE needs the VRAM of a 47B model even though it computes like a 13B one. This is why MoE models are often quantized (see the Quantization module) for consumer hardware.

MoE models can also be harder to train — routing instability, load-balancing tuning, and expert under-utilization are real engineering challenges. And the sparse structure complicates distributed serving: experts must be sharded across GPUs carefully to avoid communication bottlenecks. The table below summarizes the trade-offs.

Aspect	Dense Model	Sparse MoE
Compute / token	100% (all params)	~15-25% (top-k)
Memory footprint	Proportional	Large (all experts loaded)
Total capacity	Limited by compute	8-16× larger
Training difficulty	Stable	Load balancing needed
Serving complexity	Simple	Expert sharding

Key Takeaways

MoE replaces a single FFN with N parallel experts and a router that activates only the top-k per token — giving huge capacity at a fraction of the compute.

The router is a tiny softmax layer whose TopK decision controls the path each token takes. Training the router well is the central challenge.

Without a load-balancing auxiliary loss, MoE suffers routing collapse — a few experts hog all tokens while the rest starve and never learn.

Experts develop emergent specialization (code, math, specific languages) learned implicitly from data — no manual assignment.

MoE trades cheap per-token compute for a large memory footprint: Mixtral 8×7B computes like 13B but needs 47B of VRAM, which is why it is often paired with quantization.

Explore Related Topics

Continue your journey through LLM fundamentals: