What is RAG? Retrieval-Augmented Generation Explained
Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by retrieving relevant information from external knowledge bases before generating answers. It's the most practical approach to making LLMs accurate, up-to-date, and grounded in facts.
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with external knowledge retrieval. Instead of relying solely on what the LLM memorized during training, RAG first retrieves relevant documents or passages from a knowledge base, then provides them as context for the LLM to generate a response.
The concept was introduced by Facebook AI Research (now Meta AI) in a 2020 paper by Patrick Lewis et al. The key insight was simple but powerful: don't make the model memorize everything — let it look things up.
Think of it this way: an LLM without RAG is like a brilliant student taking a closed-book exam. They might get many answers right from memory, but they'll also make mistakes and "hallucinate" facts. An LLM with RAG is like the same student taking an open-book exam — they can look up the relevant material and provide accurate, well-sourced answers.
The Problem RAG Solves
Standard LLMs have several fundamental limitations:
- Knowledge cutoff: They only know information from their training data. Ask about events after the cutoff, and they'll guess or hallucinate.
- Hallucinations: They can generate confident but incorrect information, especially about specific facts, numbers, or niche topics.
- No source attribution: They can't tell you where their information came from, making it hard to verify claims.
- Private data ignorance: They know nothing about your company's internal documents, products, or policies.
RAG addresses all of these by providing the model with relevant, up-to-date, and sourced information at query time.
Why RAG Matters
RAG has become the dominant pattern for building production AI applications. Here's why:
1. Accuracy and Trust
By grounding responses in retrieved documents, RAG dramatically reduces hallucinations. The model's response is based on actual documents rather than its training memory. When the retrieved context doesn't contain the answer, a well-implemented RAG system will say "I don't know" rather than making something up.
2. Up-to-Date Information
RAG knowledge bases can be updated in real-time. When your company releases a new product, you add the documentation to the knowledge base — no retraining needed. The LLM immediately has access to the latest information.
3. Source Attribution
RAG can provide citations for its claims. Instead of "The answer is X," it can say "According to [document], the answer is X." This is critical for enterprise applications where users need to verify information.
4. Private Data Access
RAG enables LLMs to work with your private data without sending it to model providers for training. Your documents stay in your infrastructure, and the LLM only sees them as context during inference.
5. Cost Efficiency
Compared to fine-tuning, RAG is much cheaper to set up and maintain. You don't need to retrain the model — just update your knowledge base. This makes it accessible to teams of any size.
RAG is the bridge between the general knowledge of LLMs and the specific, current, private information your application needs.
How RAG Works: Retrieve → Augment → Generate
A RAG pipeline has three main stages:
Stage 1: Indexing (Offline)
Before any queries happen, you prepare your knowledge base:
- Chunk: Split documents into smaller passages (typically 200-1000 tokens each). Smaller chunks are more precise but may lose context; larger chunks preserve context but are less targeted.
- Embed: Convert each chunk into a numerical vector using an embedding model (e.g., OpenAI's text-embedding-3-small, or open-source models like BGE or E5).
- Store: Save the vectors and their corresponding text in a vector database (Pinecone, Weaviate, Chroma, Qdrant, etc.).
Stage 2: Retrieval (Online — when a user asks a question)
- Embed the query: Convert the user's question into a vector using the same embedding model.
- Search: Find the most similar document chunks using vector similarity search (typically cosine similarity or dot product).
- Rerank (optional): Use a reranking model to reorder results by relevance, improving precision.
Stage 3: Generation (Online)
- Augment the prompt: Combine the user's question with the retrieved context into a prompt:
"Based on the following context, answer the question. If the context doesn't contain the answer, say so.
Context: [retrieved chunks]
Question: [user's question]" - Generate: Send the augmented prompt to the LLM, which generates a response grounded in the retrieved context.
This three-stage pipeline — index, retrieve, generate — is the foundation of every RAG system. The quality of each stage directly impacts the final answer quality.
Vector Databases in RAG
Vector databases are purpose-built for storing and searching high-dimensional vectors. They're a critical component of RAG systems.
Why Vector Databases?
Traditional databases search by exact match (SQL WHERE clauses). But semantic search requires finding similar items, not exact matches. "How do I reset my password?" should match "Steps to change your account password" — even though no words overlap. Vector databases enable this by operating in embedding space where semantic similarity is measured by distance.
How Vector Search Works
- All documents are converted to vectors (e.g., 1536-dimensional for OpenAI embeddings)
- Vectors are indexed using algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) for fast approximate search
- When a query vector arrives, the database finds the k nearest vectors using cosine similarity or dot product
Popular Vector Databases
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud | Production, easy setup |
| Weaviate | Open-source / cloud | Hybrid search, multimodal |
| Chroma | Open-source | Prototyping, local development |
| Qdrant | Open-source / cloud | High performance, filtering |
| Milvus | Open-source | Large scale, enterprise |
| pgvector | PostgreSQL extension | Existing PostgreSQL users |
For more on vector databases, see our Vector Database Explained guide.
RAG vs Fine-Tuning
RAG and fine-tuning are two approaches to customizing LLM behavior. They solve different problems and are often complementary.
When to Use RAG
- Frequently changing information: Product docs, policies, news
- Source attribution needed: Users need to verify claims
- Large knowledge base: Millions of documents that can't fit in training data
- Quick setup: No ML expertise required, just document indexing
- Cost-sensitive: No GPU training costs
When to Use Fine-Tuning
- Changing model behavior: Tone, style, format, reasoning patterns
- Domain-specific language: Medical, legal, or technical jargon
- Consistent output format: JSON, specific templates, structured data
- Task-specific performance: Classification, extraction, specialized tasks
Using Both Together
The most effective production systems often combine both:
- Fine-tune the model to understand your domain's language and output format
- RAG provides the specific, up-to-date facts and data
For example, a medical AI might be fine-tuned to understand medical terminology and respond in clinical language (fine-tuning), while retrieving the latest research papers and drug information (RAG).
RAG gives models knowledge. Fine-tuning gives models skills. The best systems use both.
Advanced RAG Techniques
Basic RAG (embed → search → generate) works, but production systems use several advanced techniques to improve quality:
Query Transformation
Before retrieval, transform the user's query to improve search results:
- Query rewriting: Use an LLM to rephrase the query for better retrieval
- HyDE (Hypothetical Document Embedding): Generate a hypothetical answer, embed it, and use it for retrieval (often retrieves better than the raw question)
- Query decomposition: Break complex questions into sub-questions, retrieve for each
Hybrid Search
Combine vector search (semantic) with keyword search (BM25) for better recall. Many relevant documents might use different terminology than the query — keyword search catches these while vector search catches semantic matches.
Reranking
After initial retrieval, use a cross-encoder model (like Cohere Rerank or BGE Reranker) to score each document against the query. Rerankers are more accurate than embedding similarity but too slow to run on the full database — so you retrieve top-50 with vectors, then rerank to top-5.
Contextual Compression
Instead of passing full retrieved chunks to the LLM, extract only the relevant sentences. This reduces noise and fits more relevant information into the context window.
Multi-hop Retrieval
For complex questions that require information from multiple documents, perform multiple rounds of retrieval. The first retrieval provides context that informs a second, more targeted retrieval.
Self-RAG
The model decides whether it needs to retrieve at all. For simple factual questions, it might answer from its training data. For questions requiring current or specific information, it triggers retrieval. This is more efficient and avoids unnecessary retrieval overhead.
Implementation Overview
Here's what you need to build a RAG system:
Core Components
- Document loader: Reads your documents (PDF, HTML, Markdown, etc.)
- Text splitter: Chunks documents into passages
- Embedding model: Converts text to vectors
- Vector store: Stores and searches vectors
- LLM: Generates the final response
- Orchestration framework (optional): LangChain, LlamaIndex, or custom code
Popular Frameworks
| Framework | Strengths | Best For |
|---|---|---|
| LangChain | Extensive integrations, flexible | Complex pipelines, many integrations |
| LlamaIndex | Data-focused, great indexing | Data-heavy RAG applications |
| Haystack | Pipeline-based, production-ready | Enterprise search systems |
| Custom (API + SDK) | Full control, minimal dependencies | Simple systems, learning |
For a complete LangChain tutorial, see our LangChain Tutorial.
Code Example: Simple RAG Pipeline
Here's a minimal RAG implementation using Python, showing the core concepts without framework abstractions:
import numpy as np
from openai import OpenAI
client = OpenAI()
# --- Step 1: Index documents (offline) ---
documents = [
"The Transformer architecture uses self-attention to process sequences in parallel.",
"BERT is an encoder-only model trained with masked language modeling.",
"GPT is a decoder-only model trained to predict the next token.",
"LoRA fine-tunes models by adding low-rank decomposition matrices.",
"RAG retrieves relevant documents to augment LLM responses.",
]
# Embed all documents
response = client.embeddings.create(
input=documents,
model="text-embedding-3-small"
)
doc_vectors = np.array([e.embedding for e in response.data])
# --- Step 2: Retrieve (online) ---
def retrieve(query, top_k=3):
"""Find the most relevant documents for a query."""
# Embed the query
q_response = client.embeddings.create(
input=[query],
model="text-embedding-3-small"
)
q_vector = np.array(q_response.data[0].embedding)
# Cosine similarity
similarities = np.dot(doc_vectors, q_vector) / (
np.linalg.norm(doc_vectors, axis=1) * np.linalg.norm(q_vector)
)
# Get top-k indices
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [(documents[i], similarities[i]) for i in top_indices]
# --- Step 3: Generate with context (online) ---
def rag_answer(question):
"""Answer a question using RAG."""
# Retrieve relevant documents
results = retrieve(question, top_k=3)
context = "\n".join([doc for doc, score in results])
# Generate answer with context
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Answer based on the provided context. "
"If the context doesn't contain the answer, say so."
)},
{"role": "user", "content": (
f"Context:\n{context}\n\n"
f"Question: {question}"
)}
]
)
return response.choices[0].message.content
# Usage
answer = rag_answer("How does the Transformer process sequences?")
print(answer)
This simplified example demonstrates the core RAG pattern: embed documents, retrieve by similarity, and generate with context. Production systems add chunking strategies, reranking, hybrid search, and error handling on top of this foundation.
Frequently Asked Questions
What is RAG in simple terms?
RAG (Retrieval-Augmented Generation) is a technique where an LLM first retrieves relevant information from a knowledge base, then uses that information to generate its answer. Instead of relying solely on what the model memorized during training, it looks up current, specific information before responding.
What is the difference between RAG and fine-tuning?
Fine-tuning modifies the model's weights to embed knowledge permanently. RAG keeps the model unchanged and instead provides relevant context at query time. RAG is better for frequently changing information and when you need source attribution. Fine-tuning is better for changing the model's behavior, style, or format.
What is a vector database and why is it needed for RAG?
A vector database stores text as numerical vectors (embeddings) and enables fast similarity search. In RAG, documents are converted to vectors and stored. When a query comes in, it's also converted to a vector, and the database finds the most similar documents. Popular vector databases include Pinecone, Weaviate, Chroma, and Qdrant.
Can RAG completely eliminate LLM hallucinations?
RAG significantly reduces hallucinations by grounding responses in retrieved facts, but it doesn't eliminate them completely. The LLM can still misinterpret retrieved context, combine information incorrectly, or generate claims not supported by the retrieved documents. Techniques like citation enforcement and faithfulness checking help further reduce hallucinations.