What is RLHF? Reinforcement Learning from Human Feedback
RLHF (Reinforcement Learning from Human Feedback) is the technique that transforms a raw language model into a helpful, aligned AI assistant. It's the key ingredient that makes ChatGPT, Claude, and Gemini actually useful.
What is RLHF?
RLHF stands for Reinforcement Learning from Human Feedback. It's a training technique that uses human preferences to fine-tune a language model so its outputs align with what humans actually want — helpful, harmless, and honest responses.
Without RLHF, a language model trained purely on next-token prediction would be good at predicting statistically likely text, but not necessarily at being a good assistant. It might produce rambling, unhelpful, or even harmful outputs because its training objective (predict the next word) doesn't directly encode "be helpful."
RLHF solves this by introducing a human preference signal into the training loop. Humans rank model outputs from best to worst, a reward model learns these preferences, and then the language model is optimized to produce outputs that score high on the reward model.
This technique was pioneered by OpenAI in their InstructGPT paper (2022) and became the standard approach for training ChatGPT, Claude, and most modern AI assistants.
Why Alignment Matters
The alignment problem is one of the most important challenges in AI: how do we ensure AI systems do what we actually want them to do?
A base language model has a simple objective: predict the next token. This creates several problems:
- Not instruction-following: Given "Write a poem about cats," a base model might continue with more instructions instead of actually writing a poem
- Harmful content: The model will generate whatever is statistically likely, including toxic, biased, or dangerous content
- Verbosity: Models tend to be overly wordy because training data contains lots of verbose text
- Hallucination: Models confidently state false information because the training objective doesn't distinguish truth from falsehood
- No safety boundaries: The model has no concept of what it should refuse to do
RLHF addresses these issues by directly optimizing for human preferences. Instead of asking "what text is most likely?", we ask "what text would a human prefer?" These are fundamentally different questions, and the difference is what makes aligned AI assistants possible.
The gap between "predicting likely text" and "being a helpful assistant" is exactly what RLHF bridges. It's the difference between a model that can write text and a model that can have a useful conversation.
The RLHF Pipeline
The full RLHF pipeline consists of three main stages. Let's walk through each one.
Stage 1: Supervised Fine-Tuning (SFT)
Before RLHF begins, the base model is fine-tuned on high-quality instruction-response pairs. Human annotators write ideal responses to various prompts, and the model is trained to imitate these responses. This gives the model a basic ability to follow instructions.
After SFT, the model can follow instructions but its outputs are only as good as the training data. It doesn't yet have a way to learn from ongoing feedback.
Stage 2: Reward Model Training
Human annotators are shown multiple model outputs for the same prompt and asked to rank them from best to worst. These rankings are used to train a reward model — a neural network that predicts how much a human would like any given output.
Stage 3: Reinforcement Learning (PPO)
The SFT model is further optimized using reinforcement learning. The model generates outputs, the reward model scores them, and the model is updated to produce higher-scoring outputs. This is typically done using the PPO (Proximal Policy Optimization) algorithm.
The complete pipeline looks like this:
- Pre-train base model on large text corpus
- Fine-tune on human-written instruction-response pairs (SFT)
- Collect human preference data (ranked outputs)
- Train reward model on preference data
- Optimize SFT model using PPO + reward model
- Evaluate and iterate
Building the Reward Model
The reward model is the heart of RLHF. It learns to predict human preferences so we can use it as a scalable proxy for human judgment during training.
How It Works
Given a prompt and a response, the reward model outputs a scalar score indicating how good the response is. The model is trained on pairwise comparisons: given two responses to the same prompt, which one is better?
The training data looks like this:
- Prompt: "Explain quantum computing to a 10-year-old"
- Response A: "Quantum computing uses qubits that can be 0 and 1 simultaneously..."
- Response B: "Imagine you have a magic coin that can be both heads and tails at the same time..."
- Human judgment: B is better (simpler, more age-appropriate)
The reward model learns from thousands of these comparisons. It uses a ranking loss function that ensures the preferred response gets a higher score than the rejected one.
Reward Model Architecture
Typically, the reward model is initialized from the SFT model itself. The language modeling head is replaced with a scalar output head that produces a single number (the reward score). This leverages the model's existing language understanding while adding the ability to evaluate quality.
Data Collection Challenges
Collecting high-quality preference data is expensive and time-consuming:
- Annotators need training and guidelines
- Inter-annotator agreement is often low (people disagree on quality)
- Some outputs are hard to compare (both good, both bad, different strengths)
- Cultural and personal biases affect rankings
Companies like OpenAI and Anthropic employ hundreds of trained annotators and use detailed rubrics to improve consistency.
PPO: Optimizing the Policy
Once we have a reward model, we can use reinforcement learning to optimize the language model. The language model is treated as a policy in RL terminology — it takes an action (generating a token) given a state (the prompt and tokens generated so far).
Why PPO?
Proximal Policy Optimization (PPO) is the standard algorithm used in RLHF because it's stable and efficient. Other RL algorithms (like vanilla policy gradient) can be unstable when applied to language models, leading to catastrophic forgetting or reward hacking.
PPO works by constraining how much the policy can change in each update step. This prevents the model from making drastic changes that might improve the reward score but degrade output quality.
The PPO Training Loop
- Generate: The model generates a response to a prompt
- Score: The reward model assigns a score to the response
- Compute advantage: Calculate how much better/worse this response is compared to expected
- Update policy: Adjust the model to produce higher-reward outputs, with a clipping constraint to prevent large changes
- KL penalty: Add a penalty for diverging too far from the SFT model to prevent reward hacking
The KL Penalty
A crucial component of PPO in RLHF is the KL divergence penalty. This measures how different the current model's output distribution is from the original SFT model. Without this penalty, the model could exploit weaknesses in the reward model — producing outputs that score high but are nonsensical or repetitive.
The KL penalty ensures the model stays close to sensible language while still optimizing for the reward signal.
DPO: The Simpler Alternative
In 2023, researchers at Stanford introduced Direct Preference Optimization (DPO), a simpler alternative to RLHF that eliminates the need for a separate reward model and PPO training.
How DPO Works
DPO's key insight is that the optimal policy can be derived directly from the preference data, without needing to first train a reward model. Instead of the two-stage process (train reward model → optimize with PPO), DPO does everything in one step.
The DPO loss function directly increases the likelihood of preferred responses and decreases the likelihood of rejected responses, with a built-in KL constraint that prevents the model from diverging too far from the reference policy.
RLHF vs DPO
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Complexity | High — requires reward model + RL loop | Low — single-stage training |
| Stability | Can be unstable, needs careful tuning | More stable, easier to implement |
| Scalability | Scales well with compute | Scales well, less compute needed |
| Online learning | Can use online feedback | Offline only (fixed dataset) |
| Performance | Often slightly better on complex tasks | Comparable on most tasks |
| Adoption | OpenAI, Anthropic (legacy) | Meta (Llama 3), many open-source |
In practice, many organizations now use DPO or its variants (like ORPO and KTO) because they're simpler and achieve comparable results. However, RLHF remains important for frontier models where every bit of performance matters.
Limitations of RLHF
Despite its importance, RLHF has significant limitations that researchers are actively working to address.
Reward Hacking
The model can learn to exploit weaknesses in the reward model rather than genuinely improving. For example, it might learn that longer responses score higher and produce unnecessarily verbose outputs, or use specific phrases that trigger high reward scores without actually being better.
Annotation Quality
The quality of RLHF is limited by the quality of human annotations. Annotators may have biases, make mistakes, or disagree with each other. Complex tasks (like evaluating code correctness or mathematical reasoning) are especially hard for human raters.
Over-Optimization
Models can become "too aligned" — overly cautious, refusing to answer legitimate questions, or giving hedged non-answers. This is sometimes called the "alignment tax" where safety comes at the cost of usefulness.
Scalability
RLHF requires ongoing human annotation, which is expensive and doesn't scale well. As models improve, the tasks that need human evaluation become harder and more specialized.
Emerging Alternatives
Researchers are exploring several alternatives to traditional RLHF:
- RLAIF (RL from AI Feedback): Using AI models instead of humans to generate preference data
- Constitutional AI: Training models to self-critique based on a set of principles
- Process Reward Models: Rewarding correct reasoning steps, not just final answers
- Self-Play: Models improving by competing against themselves
Frequently Asked Questions
What does RLHF stand for?
RLHF stands for Reinforcement Learning from Human Feedback. It's a technique for training AI models to produce outputs that align with human preferences by using human rankings to build a reward model, then optimizing the LLM against that reward.
Why is RLHF important for LLMs?
Without RLHF, LLMs trained purely on next-token prediction tend to produce outputs that are statistically likely but not necessarily helpful, truthful, or safe. RLHF bridges the gap between "predicting text" and "being a useful assistant" by directly optimizing for human-preferred behavior.
What is the difference between RLHF and DPO?
RLHF uses a separate reward model and reinforcement learning (PPO) to optimize the LLM. DPO (Direct Preference Optimization) skips the reward model entirely — it directly optimizes the LLM using preference pairs. DPO is simpler to implement and more stable, but RLHF can be more powerful for complex alignment tasks.
What are the main limitations of RLHF?
RLHF has several limitations: it requires expensive human annotation, the reward model can be gamed (reward hacking), human raters may have biases, and it can make models overly cautious. It also struggles with tasks where humans can't easily evaluate quality, such as complex reasoning or code correctness.