Reflection
LLM critiques its own output and self-improves through structured feedback.
Reflection (Self-Critique) — Overview
Reflection enables an agent to evaluate and improve its own output by generating self-critique, then revising based on that critique. Unlike the Evaluator-Optimizer workflow where evaluation is external, reflection is self-directed — the agent identifies its own weaknesses.
Evolves from: Evaluator-Optimizer — adds self-generated critique, richer self-awareness, and adaptive refinement strategies.
Architecture
Figure: The agent generates output, reflects on its quality, and revises until the critique is satisfactory or the iteration budget is exhausted.
How It Works
- Generate — The LLM produces an initial output based on the task.
- Reflect — The same (or different) LLM critiques the output against quality criteria. The critique is specific: what's wrong, what's missing, what could be better.
- Decide — If the critique indicates the output is acceptable, return it. If not, continue.
- Revise — The LLM revises its output using the critique as guidance. The revision prompt includes both the original output and the specific critique.
- Repeat — The revised output goes through reflection again. This continues until quality is satisfactory or the iteration limit is reached.
The key difference from Evaluator-Optimizer: in reflection, the critique is self-generated and often richer — the LLM reasons about its own output's strengths and weaknesses, rather than just producing a score.
Minimal Example
Write a beginner-friendly technical explanation, then self-critique and revise until all criteria are met.
from patterns.reflection.code.python.reflection import ReflectionAgent
agent = ReflectionAgent(
llm=your_llm,
criteria="""
- Technically accurate — no invented APIs or incorrect statements
- Includes a concrete, runnable code example
- No unexplained jargon
- Under 250 words
""",
max_iterations=3,
)
result = agent.run("Write a beginner-friendly explanation of database indexing")
# result.passed → True if all criteria were met before max_iterations
# result.iterations[i].draft → the generated draft at iteration i
# result.iterations[i].critique → self-generated critique (issues + suggestion)
# result.iterations[i].passed → whether the critic approved this version
# result.final_output → the self-improved final version
Why use Reflection instead of a single prompt? A single prompt asks the LLM to simultaneously write and judge its output — two conflicting objectives. Separating generation from critique with distinct prompts consistently produces higher-quality results.
Full implementation: [`code/python/reflection.py`](code/python/reflection.py)
Input / Output
- Input: A task requiring high-quality output
- Output: Refined output that has survived self-critique
- Critique: Specific feedback identifying strengths, weaknesses, and improvement suggestions
- Revision: Updated output addressing the critique
Key Tradeoffs
| Strength | Limitation |
|---|---|
| Self-improving — catches its own mistakes | 2+ LLM calls per iteration (expensive) |
| Richer feedback than numeric scoring | LLMs have blind spots — may not catch all errors |
| No external evaluator needed | Can over-optimize, losing the original intent |
| Builds a chain of reasoning about quality | Diminishing returns after 2–3 iterations |
| Works for any generation task | Self-critique can be overconfident or miss systematic biases |
When to Use
- High-stakes content generation (reports, code, analysis) where quality matters
- When you can define clear quality criteria for the LLM to evaluate against
- When external evaluation is unavailable or impractical
- Tasks where iterative refinement naturally improves quality (writing, code review)
- When you want an audit trail of improvements (critique chain)
When NOT to Use
- When first-pass quality is sufficient — reflection doubles the cost at minimum
- When latency is critical — each iteration adds a full round-trip
- When you have a reliable external evaluator — use Evaluator-Optimizer instead
- For factual retrieval tasks — reflection can't fix missing knowledge; use RAG
Related Patterns
- Evolves from: Evaluator-Optimizer — see evolution.md
- Combines with: ReAct (reflect on tool call results), Plan & Execute (reflect on plan quality before execution)
- Simpler alternative: Evaluator-Optimizer (when a score + feedback loop is sufficient)
Deeper Dive
- Design — Critique prompt design, revision strategies, convergence detection, quality criteria
- Implementation — Pseudocode, reflection prompts, iteration management, testing
- Evolution — How reflection evolves from evaluator-optimizer