Why Alignment Matters

A pre-trained language model is essentially a sophisticated text predictor. Given a prompt, it continues the most statistically likely sequence of tokens. This is powerful — it can write essays, code, and poetry — but it has no concept of what is helpful, harmless, or honest. Ask it to explain quantum physics and it might respond with a beautiful explanation, a conspiracy theory, or simply refuse — it depends on what pattern appeared most often in its training data.

This is the alignment problem: how do we make a model that not only generates fluent text, but actually follows instructions, refuses harmful requests, and stays truthful? The answer, pioneered by OpenAI in the InstructGPT paper (2022) and later used to train ChatGPT, is Reinforcement Learning from Human Feedback — RLHF. It is the single most important technique that turned raw language models into the helpful assistants we use today.

The RLHF Pipeline Overview

RLHF is not a single algorithm — it is a three-stage pipeline. Each stage builds on the previous one, progressively shaping the model's behavior to align with human preferences.

Click each stage block to explore the RLHF pipeline

Each stage serves a distinct purpose: SFT teaches the model the basic skill of instruction-following, the reward model encodes human preferences into a differentiable function, and PPO uses that function to fine-tune the model's behavior. Let us walk through each stage in detail.

Stage 1: Supervised Fine-Tuning (SFT)

We start with the base pre-trained model — a raw text predictor. The first step is to teach it the basic skill of following instructions. We do this by fine-tuning on a curated dataset of (instruction, response) pairs, where each response is written or verified by a human to be high-quality.

What makes a good SFT dataset?

Quality matters far more than quantity. The original InstructGPT paper used roughly 13,000 instruction-response pairs for SFT — a tiny dataset by ML standards. Each example was carefully crafted by human labelers to demonstrate the desired output format, tone, and level of detail. A few thousand excellent examples beat millions of mediocre ones.

After SFT, the model can follow instructions reasonably well — it will respond in a helpful format, stay on topic, and generally produce coherent answers. But it is still not fully aligned: it might be overly verbose, give unsafe advice, or fail to acknowledge uncertainty. The next two stages address these remaining issues.

The pre-trained model learns to follow instructions by training on prompt-response pairs — its outputs gradually align with human-written responses

Stage 2: Reward Model Training

The reward model is the bridge between human judgment and automated optimization. The idea is elegant: instead of asking humans to evaluate every single model output (which is far too expensive at scale), we train a model to predict human preferences.

Here is how it works. For a given prompt, the SFT model generates multiple responses (typically 4-9). Human labelers rank these responses from best to worst based on criteria like helpfulness, truthfulness, and safety. These rankings become training data for the reward model — a separate neural network (often initialized from the SFT model) that takes a (prompt, response) pair and outputs a scalar reward score. The reward model is trained using the Bradley-Terry preference model: given two responses y₁ and y₂ where humans preferred y₁, the model learns to assign r(x, y₁) > r(x, y₂).

How the Reward Model Learns

The diagram below illustrates the reward model training process. A prompt goes to the model, which generates multiple responses. Humans rank these responses, and the reward model learns to assign scores that reflect human preferences.

The reward model learns to score preferred responses higher — it is trained on thousands of human preference comparisons.

Stage 3: PPO Optimization

With the reward model in hand, we now optimize the LLM's behavior using Proximal Policy Optimization (PPO) — a reinforcement learning algorithm. The LLM (called the "policy" in RL terms) generates responses, the reward model scores them, and PPO updates the policy to maximize the expected reward.

PPO is chosen for its stability. Unlike vanilla policy gradient methods, which can make destructively large updates, PPO constrains how much the policy can change in a single step. This is crucial because language models are sensitive — a small change in weights can drastically alter output quality. PPO's clipped objective ensures that each update step is conservative, keeping the model on a stable improvement trajectory.

The policy (LLM) generates responses, the reward model scores them, and PPO feeds the reward signal back to update the policy.

The Reward Landscape

The 3D visualization below shows an abstract reward landscape. The terrain height represents the reward score — higher ground means the model's output is more aligned with human preferences. Two paths are shown: a green path (high reward) representing a helpful, well-structured response, and a red path (low reward) representing a poor or harmful response. The yellow ring represents the KL divergence boundary — a constraint that prevents the model from drifting too far from its original behavior.

Green path: high-reward response navigating toward the peak. Red path: low-reward response drifting into a valley. Yellow ring: KL penalty boundary.

The model's goal during PPO is to navigate this landscape, finding regions of high reward while staying within the KL boundary. If it strays too far from the reference model (crosses the yellow ring), the KL penalty kicks in, pushing it back toward safer territory. This balance between exploration and constraint is what makes RLHF both powerful and stable.

The KL Penalty: Preventing Reward Hacking

Left unchecked, the policy would learn to exploit the reward model — generating outputs that score highly but are nonsensical or harmful. This is called reward hacking. To prevent it, RLHF adds a KL divergence penalty to the objective function.

objective = E[r(x,y)] − β · KL(π_θ ‖ π_ref)

The KL penalty measures how different the current policy π_θ is from the reference policy π_ref (the SFT model before PPO). The hyperparameter β controls the strength of this constraint. A higher β means the model is more conservative, staying closer to the SFT model. A lower β allows more exploration but risks reward hacking. In practice, β is tuned carefully — typically starting low and increasing during training.

Think of the KL penalty like a leash. The model is encouraged to explore the reward landscape for better outputs, but the leash keeps it from running too far from home (the reference model). Too tight a leash and the model cannot improve; too loose and it wanders into dangerous territory.

DPO — Direct Preference Optimization

In 2023, Rafailov et al. proposed DPO (Direct Preference Optimization), a simpler alternative to the RLHF pipeline. The key insight is mathematical: the optimal RLHF policy has a closed-form solution that can be derived directly from the preference data, without needing to train a separate reward model or run PPO.

DPO reformulates the RLHF objective as a simple classification loss on the preference pairs. Given a prompt x and two responses y_w (preferred) and y_l (dispreferred), DPO trains the policy to increase the likelihood of y_w relative to y_l, using the reference model as a baseline. The entire reward-modeling and PPO pipeline is replaced by a single fine-tuning step. The result: DPO requires only 2 models in memory (policy + reference) versus 4 for PPO (policy, reference, reward model, and a value head). It is simpler to implement, more stable to train, and increasingly popular in practice.

RLHF vs DPO: Pipeline Comparison

The diagram below compares the two pipelines side by side. RLHF (PPO) goes through three distinct stages, including a separate reward model training phase. DPO collapses the last two stages into a single direct optimization step.

DPO skips the reward model entirely — simpler pipeline, fewer models in memory, increasingly popular for open-source fine-tuning.

Beyond Human Feedback: Constitutional AI & RLAIF

RLHF has a fundamental scaling limitation: it requires human labelers. As models become more capable, labeling their outputs requires increasingly specialized expertise, and the cost grows linearly with the amount of feedback needed. Two approaches address this: Constitutional AI and RLAIF.

Constitutional AI (Anthropic, 2022) introduces the idea of self-critique. Instead of relying solely on human feedback, the model is given a set of principles (a "constitution") and asked to critique and revise its own outputs. For example, the model generates a response, then evaluates whether it is helpful and harmless according to the constitution, and revises it accordingly. This process generates synthetic preference data that can replace or augment human labels. RLAIF (RL from AI Feedback) extends this idea: use a powerful AI model (like GPT-4) to label preferences instead of humans, then train with the same RLHF pipeline.

The Future of Alignment

As models become more capable of evaluating their own outputs, the field is moving toward scalable oversight — methods that can align superhuman AI systems without requiring superhuman labelers. Constitutional AI and RLAIF are early steps in this direction, but the challenge of aligning AI systems that are smarter than their overseers remains one of the most important open problems in AI safety.

Key Takeaways

RLHF is a three-stage pipeline (SFT → Reward Model → PPO) that aligns language models with human preferences, transforming raw text predictors into helpful assistants.

The reward model encodes human preferences into a differentiable scoring function, enabling automated optimization at scale without needing humans to evaluate every output.

The KL divergence penalty prevents reward hacking by keeping the optimized model close to the reference (SFT) model — it acts as a stabilizing constraint during PPO.

DPO simplifies RLHF by eliminating the reward model and PPO entirely, replacing them with a direct preference optimization loss. It is increasingly the default choice for open-source fine-tuning.

Constitutional AI and RLAIF point toward a future where AI systems help evaluate and align themselves, addressing the scalability limitations of human feedback.

Explore Related Topics

Continue your journey through LLM fundamentals:

RLHF — Aligning LLMs with Human Values