Agent Pattern

Reflection

LLM critiques its own output and self-improves through structured feedback.

Intermediate Evolves from: Evaluator-Optimizer →

Reflection (Self-Critique) — Overview

Reflection enables an agent to evaluate and improve its own output by generating self-critique, then revising based on that critique. Unlike the Evaluator-Optimizer workflow where evaluation is external, reflection is self-directed — the agent identifies its own weaknesses.

Evolves from: Evaluator-Optimizer — adds self-generated critique, richer self-awareness, and adaptive refinement strategies.

Architecture

graph TD Input([Task]) -->|"goal"| Generate[Generate:<br/>Produce initial output] Generate -->|"candidate"| Reflect[Reflect:<br/>Self-critique] Reflect -->|"critique + score"| Decide{Good enough?} Decide -->|"No"| Revise[Revise:<br/>Improve based on critique] Revise -->|"improved output"| Reflect Decide -->|"Yes"| Output([Final Output]) Guard[/"Max Iterations"/] -.->|"budget exceeded"| Output style Input fill:#e3f2fd style Generate fill:#fff3e0 style Reflect fill:#e8f5e9 style Decide fill:#fce4ec style Revise fill:#fff3e0 style Output fill:#e3f2fd style Guard fill:#fff8e1

Figure: The agent generates output, reflects on its quality, and revises until the critique is satisfactory or the iteration budget is exhausted.

How It Works

Generate — The LLM produces an initial output based on the task.
Reflect — The same (or different) LLM critiques the output against quality criteria. The critique is specific: what's wrong, what's missing, what could be better.
Decide — If the critique indicates the output is acceptable, return it. If not, continue.
Revise — The LLM revises its output using the critique as guidance. The revision prompt includes both the original output and the specific critique.
Repeat — The revised output goes through reflection again. This continues until quality is satisfactory or the iteration limit is reached.

The key difference from Evaluator-Optimizer: in reflection, the critique is self-generated and often richer — the LLM reasons about its own output's strengths and weaknesses, rather than just producing a score.

Minimal Example

Write a beginner-friendly technical explanation, then self-critique and revise until all criteria are met.

from patterns.reflection.code.python.reflection import ReflectionAgent

agent = ReflectionAgent(
    llm=your_llm,
    criteria="""
    - Technically accurate — no invented APIs or incorrect statements
    - Includes a concrete, runnable code example
    - No unexplained jargon
    - Under 250 words
    """,
    max_iterations=3,
)

result = agent.run("Write a beginner-friendly explanation of database indexing")
# result.passed              → True if all criteria were met before max_iterations
# result.iterations[i].draft    → the generated draft at iteration i
# result.iterations[i].critique → self-generated critique (issues + suggestion)
# result.iterations[i].passed   → whether the critic approved this version
# result.final_output        → the self-improved final version

Why use Reflection instead of a single prompt? A single prompt asks the LLM to simultaneously write and judge its output — two conflicting objectives. Separating generation from critique with distinct prompts consistently produces higher-quality results.

Full implementation: [`code/python/reflection.py`](code/python/reflection.py)

Input / Output

Input: A task requiring high-quality output
Output: Refined output that has survived self-critique
Critique: Specific feedback identifying strengths, weaknesses, and improvement suggestions
Revision: Updated output addressing the critique

Key Tradeoffs

Strength	Limitation
Self-improving — catches its own mistakes	2+ LLM calls per iteration (expensive)
Richer feedback than numeric scoring	LLMs have blind spots — may not catch all errors
No external evaluator needed	Can over-optimize, losing the original intent
Builds a chain of reasoning about quality	Diminishing returns after 2–3 iterations
Works for any generation task	Self-critique can be overconfident or miss systematic biases

When to Use

High-stakes content generation (reports, code, analysis) where quality matters
When you can define clear quality criteria for the LLM to evaluate against
When external evaluation is unavailable or impractical
Tasks where iterative refinement naturally improves quality (writing, code review)
When you want an audit trail of improvements (critique chain)

When NOT to Use

When first-pass quality is sufficient — reflection doubles the cost at minimum
When latency is critical — each iteration adds a full round-trip
When you have a reliable external evaluator — use Evaluator-Optimizer instead
For factual retrieval tasks — reflection can't fix missing knowledge; use RAG

Evolves from: Evaluator-Optimizer — see evolution.md
Combines with: ReAct (reflect on tool call results), Plan & Execute (reflect on plan quality before execution)
Simpler alternative: Evaluator-Optimizer (when a score + feedback loop is sufficient)

Deeper Dive

Design — Critique prompt design, revision strategies, convergence detection, quality criteria
Implementation — Pseudocode, reflection prompts, iteration management, testing
Evolution — How reflection evolves from evaluator-optimizer