Agent Pattern

RAG

Retrieval-augmented generation: retrieve relevant context before generating.

Intermediate Evolves from: Parallel Calls →

RAG (Retrieval-Augmented Generation) — Overview

RAG grounds LLM responses in external knowledge by retrieving relevant documents before generating a response. Instead of relying solely on the LLM's training data, the system searches a knowledge base and injects the most relevant content into the prompt.

Evolves from: Parallel Calls — adds a retrieval step, context injection, and relevance filtering.

Architecture

Figure: The query is embedded and used to search a document store. Retrieved chunks are filtered for relevance, injected into the prompt, and the LLM generates a grounded response.

How It Works

Ingestion (offline):

Load documents from your knowledge source
Chunk documents into retrieval-sized pieces (typically 200–1000 tokens)
Embed each chunk into a vector representation
Store vectors in a searchable index (vector database)

Query (online):

Embed the user's query using the same embedding model
Search the vector store for the most similar chunks (top-K)
Filter results for relevance (similarity threshold, metadata filters)
Augment the LLM prompt with the retrieved context
Generate a response grounded in the retrieved documents

Minimal Example

Answer HR policy questions from a company handbook — retrieval ensures answers are grounded in actual policy, not LLM training data.

from patterns.rag.code.python.rag import RAGPipeline

pipeline = RAGPipeline(
    llm=your_llm,
    embedder=your_embedder,
    top_k=3,
    chunk_size=500,
)

# Ingestion — run once when documents are added or updated
n_chunks = pipeline.ingest(
    documents=company_handbook_pages,
    metadata=[{"source": "handbook", "section": s} for s in section_names],
)
print(f"Indexed {n_chunks} chunks")

# Query — at request time
result = pipeline.query("What is the process for requesting parental leave?")
# result.answer       → answer grounded in retrieved context
# result.chunks_used  → the specific handbook sections retrieved
# result.query        → original question (for logging / evaluation)

Without RAG, the LLM would answer from training data — which may be outdated or simply wrong for your company's specific policy. With RAG, the answer is always sourced from your current documents.

Code variants

Implementation	Language	Path
Framework-agnostic pipeline (MockLLM + MockEmbedder)	Python	`code/python/rag.py`
Pydantic AI (`Agent` with typed `RAGAnswer` result + retrieval-as-tool)	Python	`code/python/pydantic-ai/rag.py`
LangGraph (`StateGraph`: retrieve node → generate node)	Python	`code/python/langgraph/rag.py`
Vercel AI SDK (retrieve outside `generateText`, inline context)	TypeScript	`code/typescript/vercel-ai-sdk/rag.ts`

The framework-specific files share an identical task (ingest three short documents about ReAct / RAG / Plan-and-Execute; query for RAG) so they're diff-friendly across stacks. Both use a deterministic hash-based mock embedder so the chunk scores are reproducible offline; swap in a real embedding provider (OpenAI, Voyage, Cohere) to ship.

Input / Output

Input: User query + document store (pre-indexed)
Output: LLM response grounded in retrieved document content
Retrieved context: Top-K document chunks most relevant to the query
Ingestion input: Raw documents (text, PDF, HTML, etc.)

Key Tradeoffs

Strength	Limitation
Grounds responses in factual sources	Retrieval quality limits response quality
Reduces hallucination for knowledge-heavy tasks	Requires maintaining and indexing a document store
Knowledge can be updated without retraining	Chunking strategy significantly affects results
Works with any LLM — no fine-tuning needed	Retrieved context consumes context window tokens
Provides source attribution	Embedding quality affects search accuracy

When to Use

Question-answering over a specific knowledge base (docs, policies, code)
When the LLM needs information not in its training data
When responses must be grounded in specific source documents
When you need source attribution ("answer based on document X, section Y")
When knowledge changes frequently and fine-tuning isn't practical

When NOT to Use

When all needed information fits in the system prompt — just include it directly
When the task doesn't require external knowledge (creative writing, reasoning)
When real-time data is needed — RAG over a static index will be stale
When exact database queries would be more appropriate — use Tool Use with a DB query tool

Evolves from: Parallel Calls — see evolution.md
Combines with: ReAct (agent decides when to retrieve), Memory (shared vector store for both documents and interaction history)
Advanced form: Agentic RAG — the agent decides when, what, and how to retrieve, potentially reformulating queries or searching multiple sources

Deeper Dive

Design — Chunking strategies, embedding selection, retrieval tuning, relevance filtering, re-ranking
Implementation — Pseudocode, ingestion pipeline, query pipeline, testing with fixtures
Evolution — How RAG evolves from parallel calls

When NOT to use this pattern

All needed knowledge fits in the context window — context-stuffing is simpler.
Retrieval recall is unmeasured — you'll ship hallucinations grounded in irrelevant chunks.
The corpus is small and changes rarely — a flat document in the system prompt may suffice.

Next steps

Production version: see Blueprints → Deployments for the deployment agents that use this pattern.
Generate a starter project: see Blueprint → Spec → Scaffold.
Combine with other patterns: see the Composition guide.

Anti-compositions to watch for

Documented pairings where this pattern often fails in production. Source: composition/anti-compositions.

avoid· paired with ReAct

RAG + ReAct without retrieval grounding

RAG's value is grounding. Without citation enforcement, the agent treats retrieved chunks as suggestions and may cite them selectively to support claims they do not actually support. You pay retrieval cost without getting the hallucination defense.

Instead:Schema-enforce citations (claims must cite a retrieved span) or drop RAG and use the agent's parametric knowledge.

caution· paired with Memory

Memory + RAG over the same corpus and namespace

Memory is time-ordered, agent-owned context. RAG is topic-organized, externally-authored knowledge. Indexing them identically loses the type distinction and duplicates retrieval logic.

Instead:Same vector store infrastructure is fine; use separate namespaces with separate retrieval logic. Memory retrieval is recency-weighted, RAG retrieval is similarity-weighted.

RAG