RAG
Retrieval-augmented generation: retrieve relevant context before generating.
RAG (Retrieval-Augmented Generation) — Overview
RAG grounds LLM responses in external knowledge by retrieving relevant documents before generating a response. Instead of relying solely on the LLM's training data, the system searches a knowledge base and injects the most relevant content into the prompt.
Evolves from: Parallel Calls — adds a retrieval step, context injection, and relevance filtering.
Architecture
Figure: The query is embedded and used to search a document store. Retrieved chunks are filtered for relevance, injected into the prompt, and the LLM generates a grounded response.
How It Works
Ingestion (offline):
- Load documents from your knowledge source
- Chunk documents into retrieval-sized pieces (typically 200–1000 tokens)
- Embed each chunk into a vector representation
- Store vectors in a searchable index (vector database)
Query (online):
- Embed the user's query using the same embedding model
- Search the vector store for the most similar chunks (top-K)
- Filter results for relevance (similarity threshold, metadata filters)
- Augment the LLM prompt with the retrieved context
- Generate a response grounded in the retrieved documents
Minimal Example
Answer HR policy questions from a company handbook — retrieval ensures answers are grounded in actual policy, not LLM training data.
from patterns.rag.code.python.rag import RAGPipeline
pipeline = RAGPipeline(
llm=your_llm,
embedder=your_embedder,
top_k=3,
chunk_size=500,
)
# Ingestion — run once when documents are added or updated
n_chunks = pipeline.ingest(
documents=company_handbook_pages,
metadata=[{"source": "handbook", "section": s} for s in section_names],
)
print(f"Indexed {n_chunks} chunks")
# Query — at request time
result = pipeline.query("What is the process for requesting parental leave?")
# result.answer → answer grounded in retrieved context
# result.chunks_used → the specific handbook sections retrieved
# result.query → original question (for logging / evaluation)
Without RAG, the LLM would answer from training data — which may be outdated or simply wrong for your company's specific policy. With RAG, the answer is always sourced from your current documents.
Full implementation: [`code/python/rag.py`](code/python/rag.py)
Input / Output
- Input: User query + document store (pre-indexed)
- Output: LLM response grounded in retrieved document content
- Retrieved context: Top-K document chunks most relevant to the query
- Ingestion input: Raw documents (text, PDF, HTML, etc.)
Key Tradeoffs
| Strength | Limitation |
|---|---|
| Grounds responses in factual sources | Retrieval quality limits response quality |
| Reduces hallucination for knowledge-heavy tasks | Requires maintaining and indexing a document store |
| Knowledge can be updated without retraining | Chunking strategy significantly affects results |
| Works with any LLM — no fine-tuning needed | Retrieved context consumes context window tokens |
| Provides source attribution | Embedding quality affects search accuracy |
When to Use
- Question-answering over a specific knowledge base (docs, policies, code)
- When the LLM needs information not in its training data
- When responses must be grounded in specific source documents
- When you need source attribution ("answer based on document X, section Y")
- When knowledge changes frequently and fine-tuning isn't practical
When NOT to Use
- When all needed information fits in the system prompt — just include it directly
- When the task doesn't require external knowledge (creative writing, reasoning)
- When real-time data is needed — RAG over a static index will be stale
- When exact database queries would be more appropriate — use Tool Use with a DB query tool
Related Patterns
- Evolves from: Parallel Calls — see evolution.md
- Combines with: ReAct (agent decides when to retrieve), Memory (shared vector store for both documents and interaction history)
- Advanced form: Agentic RAG — the agent decides when, what, and how to retrieve, potentially reformulating queries or searching multiple sources
Deeper Dive
- Design — Chunking strategies, embedding selection, retrieval tuning, relevance filtering, re-ranking
- Implementation — Pseudocode, ingestion pipeline, query pipeline, testing with fixtures
- Evolution — How RAG evolves from parallel calls