How Do LLMs Learn?
A large language model is, at its core, a giant mathematical function with billions of parameters. When you first create this function, the parameters are random — the model outputs gibberish. Training is the process of adjusting these parameters so that the model produces meaningful, coherent text. It sounds simple, but the engineering required to make it work at scale is extraordinary.
The training pipeline has three major phases. First, pre-training: the model learns the structure of language by predicting the next token across trillions of words. Second, fine-tuning: the model is refined on curated instruction data to follow directions and adopt a specific style. Third, alignment: techniques like RLHF shape the model's behavior to be helpful, harmless, and honest. This page focuses on the first two phases — the mathematical foundations that make them work.
Pre-training: Predicting the Next Token
The pre-training objective is beautifully simple: given a sequence of tokens, predict the next one. That is it. The model reads "The cat sat on the" and tries to predict "mat." It reads "The capital of France is" and tries to predict "Paris." By doing this across trillions of tokens — web pages, books, code, Wikipedia — the model learns grammar, facts, reasoning patterns, and even a degree of common sense.
Why does predicting the next token work so well? Because to predict well, the model must understand context. To predict that "bank" follows "I went to the river," the model needs to know that river banks exist. To predict that "2" follows "What is 1 + 1?", the model needs arithmetic. The diversity and scale of training data is what gives LLMs their broad capabilities.
Think of it like this
Imagine you are filling in blanks on a massive exam. "The sky is ___." "Water freezes at ___ degrees Celsius." "In Shakespeare's Hamlet, the protagonist's best friend is ___." To fill in these blanks correctly, you need a working knowledge of the entire world. The model learns about the world because that is the only way to get the answers right.
The Loss Function: Measuring Mistakes
Every time the model predicts the next token, we need a way to measure how wrong it was. This is the loss function. For language models, the standard choice is cross-entropy loss. It compares the model's predicted probability distribution over the vocabulary with the actual next token.
If the model assigns a probability of 1.0 to the correct token, the loss is 0 (perfect). If the model assigns a probability of 0.01, the loss is 4.6 (very wrong). The loss is always non-negative, and lower is better. During training, we average this loss over all tokens in a batch to get a single number that summarizes the model's performance.
Cross-entropy loss has a useful property: it is differentiable. This means we can compute the gradient — how much each parameter contributed to the loss — and use that information to adjust the parameters. Without differentiability, we would have no way to systematically improve the model.
Gradient Descent: How Weights Are Updated
Once we have the loss, we compute the gradient of the loss with respect to every parameter using backpropagation. The gradient tells us the direction of steepest ascent — how to increase the loss. We go in the opposite direction to decrease it. This is gradient descent: θ ← θ − η · ∇L, where θ is the parameters, ∇L is the gradient, and η is the learning rate.
The learning rate is critical. Too high, and the model overshoots good solutions, bouncing around the loss landscape. Too low, and training takes forever, creeping along imperceptibly. Modern training uses learning rate schedules — starting warm, ramping up, then gradually decaying — and adaptive optimizers like AdamW that adjust the effective learning rate per-parameter based on gradient history.
Imagine you are blindfolded on a hilly terrain and trying to find the lowest valley. You feel the slope under your feet (the gradient) and take a step downhill. The size of your step is the learning rate. Take too big a step and you might jump over the valley entirely. Take too small a step and you will never reach the bottom.
The Loss Landscape
The interactive 3D visualization below shows what the loss landscape looks like — a surface where height represents the loss value. The optimizer (the glowing sphere) rolls downhill, searching for the lowest point (the minimum). In reality, the loss landscape has billions of dimensions, but this 3D projection captures the essential intuition.
The optimizer sphere rolls downhill toward lower loss. Rotate and zoom to explore.
Notice that the landscape is not a smooth bowl. It has multiple valleys (local minima), ridges, and plateaus. This is why optimization is hard — the optimizer can get stuck in a suboptimal valley. Techniques like momentum, learning rate scheduling, and stochastic mini-batches help the optimizer escape shallow valleys and find deeper, better minima.
The Training Pipeline
The diagram below shows the complete training loop. Data flows through the model (forward pass), the loss is computed, gradients flow backwards (backpropagation), and the optimizer updates the weights. This cycle repeats millions of times. Each cycle processes a batch of token sequences, and over time the model's predictions become increasingly accurate.
Data → Tokenizer → Model → Loss → Backprop → Update. The loop repeats millions of times.
Scaling Laws: Bigger Is Better (If Done Right)
One of the most important empirical findings in deep learning is that model performance improves predictably as you increase three things: model size (number of parameters), data size (number of training tokens), and compute (total FLOPs). This relationship follows a power law — each factor of 10x in compute yields a proportional decrease in loss.
The Chinchilla paper (Hoffmann et al., 2022) showed that most large models at the time were under-trained. GPT-3, with 175 billion parameters, was trained on 300 billion tokens. But the optimal ratio is approximately 20 tokens per parameter — meaning a 175B model should be trained on roughly 3.5 trillion tokens. DeepMind's Chinchilla (70B parameters, 1.4 trillion tokens) outperformed the much larger Gopher (280B) by simply using more data with a smaller model.
The Chinchilla Insight
For a fixed compute budget, the optimal strategy is to train a smaller model on more data — not to build the biggest possible model and under-train it. This finding reshaped how the industry trains models. LLaMA, for example, uses this approach: smaller models trained on far more tokens than the Chinchilla-optimal ratio.
Scaling Law Visualization
The plot below illustrates the scaling law relationship. The x-axis shows model size on a log scale; the y-axis shows loss. The compute-optimal curve (solid line) shows the best achievable loss for each model size when trained with the optimal amount of data. The dashed lines show what happens when you deviate — either under-training (not enough data) or wasting compute (over-training).
Model size vs. loss — the power-law relationship that governs LLM training efficiency
Fine-Tuning: From General to Specific
Pre-training produces a model that can predict the next token across any domain. But we usually want the model to do something more specific: follow instructions, answer questions, or write code. This is where fine-tuning comes in. The most common approach is Supervised Fine-Tuning (SFT): we curate a dataset of (instruction, response) pairs and continue training the model on this data.
Fine-tuning adjusts a relatively small fraction of the model's knowledge. The pre-trained model already knows how to write coherent English, reason about problems, and recall facts. Fine-tuning teaches it the format of interactions — how to respond to a user's request, when to say "I don't know," and how to structure its output. It is like teaching a knowledgeable person how to be a good teaching assistant, rather than teaching them the subject matter from scratch.
Distributed Training: Scaling Across GPUs
Training a model with billions of parameters on trillions of tokens requires thousands of GPUs working together. No single GPU has enough memory to hold the model, its gradients, and the optimizer states. Distributed training solves this by splitting the work across multiple devices. The three main strategies are shown in the diagram below.
Data parallelism, tensor parallelism, and pipeline parallelism — often combined in practice
Three parallelism strategies: Data Parallelism (same model, different data), Tensor Parallelism (split one layer across GPUs), Pipeline Parallelism (different layers on different GPUs)
Data parallelism gives each GPU a complete copy of the model but a different slice of the data. Gradients are averaged across GPUs after each step. Tensor parallelism splits individual weight matrices across GPUs — each GPU holds a portion of each layer. Pipeline parallelism assigns different layers to different GPUs, forming an assembly line. In practice, large models like GPT-4 combine all three strategies. ZeRO (Zero Redundancy Optimizer) further reduces memory by sharding optimizer states, gradients, and parameters across GPUs.
Training Costs: The Price of Intelligence
Training a frontier LLM is one of the most computationally intensive tasks ever undertaken. GPT-3's training run was estimated at roughly $4.6 million in compute costs, using thousands of GPUs for weeks. GPT-4 is estimated to have cost over $100 million. The energy consumption is equally staggering — GPT-3's training run used an estimated 1,287 MWh, roughly equivalent to the annual energy consumption of 120 average American homes.
These costs have several implications. First, only well-funded organizations can train frontier models from scratch, which raises questions about AI democratization. Second, efficiency innovations like LoRA (which we cover in the next module), quantization, and mixture-of-experts architectures are not just nice-to-have — they are economic necessities. Third, the environmental impact of training is non-trivial, and the industry is increasingly focused on renewable energy for data centers.
Key Takeaways
LLMs learn by predicting the next token across massive text corpora — a simple objective that requires understanding language and the world.
Cross-entropy loss measures prediction quality, and gradient descent systematically adjusts billions of parameters to minimize this loss.
Scaling laws show that performance improves predictably with more parameters, more data, and more compute — the Chinchilla paper showed that data scaling is as important as model scaling.
Fine-tuning adapts a pre-trained model to specific tasks using curated instruction data, without re-learning the basics of language.
Distributed training techniques (data, tensor, pipeline parallelism) and memory optimizations (ZeRO) make training billion-parameter models feasible.
Explore related topics:
Continue your learning journey: