Agent Pattern

RAG

Retrieval-augmented generation: retrieve relevant context before generating.

Intermediate Evolves from: Parallel Calls →

RAG (Retrieval-Augmented Generation) — Overview

RAG grounds LLM responses in external knowledge by retrieving relevant documents before generating a response. Instead of relying solely on the LLM's training data, the system searches a knowledge base and injects the most relevant content into the prompt.

Evolves from: Parallel Calls — adds a retrieval step, context injection, and relevance filtering.

Architecture

Figure: The query is embedded and used to search a document store. Retrieved chunks are filtered for relevance, injected into the prompt, and the LLM generates a grounded response.

How It Works

Ingestion (offline):

Load documents from your knowledge source
Chunk documents into retrieval-sized pieces (typically 200–1000 tokens)
Embed each chunk into a vector representation
Store vectors in a searchable index (vector database)

Query (online):

Embed the user's query using the same embedding model
Search the vector store for the most similar chunks (top-K)
Filter results for relevance (similarity threshold, metadata filters)
Augment the LLM prompt with the retrieved context
Generate a response grounded in the retrieved documents

Minimal Example

Answer HR policy questions from a company handbook — retrieval ensures answers are grounded in actual policy, not LLM training data.

from patterns.rag.code.python.rag import RAGPipeline

pipeline = RAGPipeline(
    llm=your_llm,
    embedder=your_embedder,
    top_k=3,
    chunk_size=500,
)

# Ingestion — run once when documents are added or updated
n_chunks = pipeline.ingest(
    documents=company_handbook_pages,
    metadata=[{"source": "handbook", "section": s} for s in section_names],
)
print(f"Indexed {n_chunks} chunks")

# Query — at request time
result = pipeline.query("What is the process for requesting parental leave?")
# result.answer       → answer grounded in retrieved context
# result.chunks_used  → the specific handbook sections retrieved
# result.query        → original question (for logging / evaluation)

Without RAG, the LLM would answer from training data — which may be outdated or simply wrong for your company's specific policy. With RAG, the answer is always sourced from your current documents.

Full implementation: [`code/python/rag.py`](code/python/rag.py)

Input / Output

Input: User query + document store (pre-indexed)
Output: LLM response grounded in retrieved document content
Retrieved context: Top-K document chunks most relevant to the query
Ingestion input: Raw documents (text, PDF, HTML, etc.)

Key Tradeoffs

Strength	Limitation
Grounds responses in factual sources	Retrieval quality limits response quality
Reduces hallucination for knowledge-heavy tasks	Requires maintaining and indexing a document store
Knowledge can be updated without retraining	Chunking strategy significantly affects results
Works with any LLM — no fine-tuning needed	Retrieved context consumes context window tokens
Provides source attribution	Embedding quality affects search accuracy

When to Use

Question-answering over a specific knowledge base (docs, policies, code)
When the LLM needs information not in its training data
When responses must be grounded in specific source documents
When you need source attribution ("answer based on document X, section Y")
When knowledge changes frequently and fine-tuning isn't practical

When NOT to Use

When all needed information fits in the system prompt — just include it directly
When the task doesn't require external knowledge (creative writing, reasoning)
When real-time data is needed — RAG over a static index will be stale
When exact database queries would be more appropriate — use Tool Use with a DB query tool

Evolves from: Parallel Calls — see evolution.md
Combines with: ReAct (agent decides when to retrieve), Memory (shared vector store for both documents and interaction history)
Advanced form: Agentic RAG — the agent decides when, what, and how to retrieve, potentially reformulating queries or searching multiple sources

Deeper Dive

Design — Chunking strategies, embedding selection, retrieval tuning, relevance filtering, re-ranking
Implementation — Pseudocode, ingestion pipeline, query pipeline, testing with fixtures
Evolution — How RAG evolves from parallel calls

RAG