Agent Pattern

Reflection

LLM critiques its own output and self-improves through structured feedback.

Intermediate Evolves from: Evaluator-Optimizer

Reflection (Self-Critique) — Overview

Reflection enables an agent to evaluate and improve its own output by generating self-critique, then revising based on that critique. Unlike the Evaluator-Optimizer workflow where evaluation is external, reflection is self-directed — the agent identifies its own weaknesses.

Evolves from: Evaluator-Optimizer — adds self-generated critique, richer self-awareness, and adaptive refinement strategies.

Architecture

graph TD Input([Task]) -->|"goal"| Generate[Generate:<br/>Produce initial output] Generate -->|"candidate"| Reflect[Reflect:<br/>Self-critique] Reflect -->|"critique + score"| Decide{Good enough?} Decide -->|"No"| Revise[Revise:<br/>Improve based on critique] Revise -->|"improved output"| Reflect Decide -->|"Yes"| Output([Final Output]) Guard[/"Max Iterations"/] -.->|"budget exceeded"| Output style Input fill:#e3f2fd style Generate fill:#fff3e0 style Reflect fill:#e8f5e9 style Decide fill:#fce4ec style Revise fill:#fff3e0 style Output fill:#e3f2fd style Guard fill:#fff8e1

Figure: The agent generates output, reflects on its quality, and revises until the critique is satisfactory or the iteration budget is exhausted.

How It Works

  1. Generate — The LLM produces an initial output based on the task.
  2. Reflect — The same (or different) LLM critiques the output against quality criteria. The critique is specific: what's wrong, what's missing, what could be better.
  3. Decide — If the critique indicates the output is acceptable, return it. If not, continue.
  4. Revise — The LLM revises its output using the critique as guidance. The revision prompt includes both the original output and the specific critique.
  5. Repeat — The revised output goes through reflection again. This continues until quality is satisfactory or the iteration limit is reached.

The key difference from Evaluator-Optimizer: in reflection, the critique is self-generated and often richer — the LLM reasons about its own output's strengths and weaknesses, rather than just producing a score.

Minimal Example

Write a beginner-friendly technical explanation, then self-critique and revise until all criteria are met.

from patterns.reflection.code.python.reflection import ReflectionAgent

agent = ReflectionAgent(
    llm=your_llm,
    criteria="""
    - Technically accurate — no invented APIs or incorrect statements
    - Includes a concrete, runnable code example
    - No unexplained jargon
    - Under 250 words
    """,
    max_iterations=3,
)

result = agent.run("Write a beginner-friendly explanation of database indexing")
# result.passed              → True if all criteria were met before max_iterations
# result.iterations[i].draft    → the generated draft at iteration i
# result.iterations[i].critique → self-generated critique (issues + suggestion)
# result.iterations[i].passed   → whether the critic approved this version
# result.final_output        → the self-improved final version

Why use Reflection instead of a single prompt? A single prompt asks the LLM to simultaneously write and judge its output — two conflicting objectives. Separating generation from critique with distinct prompts consistently produces higher-quality results.

Full implementation: [`code/python/reflection.py`](code/python/reflection.py)

Input / Output

  • Input: A task requiring high-quality output
  • Output: Refined output that has survived self-critique
  • Critique: Specific feedback identifying strengths, weaknesses, and improvement suggestions
  • Revision: Updated output addressing the critique

Key Tradeoffs

Strength Limitation
Self-improving — catches its own mistakes 2+ LLM calls per iteration (expensive)
Richer feedback than numeric scoring LLMs have blind spots — may not catch all errors
No external evaluator needed Can over-optimize, losing the original intent
Builds a chain of reasoning about quality Diminishing returns after 2–3 iterations
Works for any generation task Self-critique can be overconfident or miss systematic biases

When to Use

  • High-stakes content generation (reports, code, analysis) where quality matters
  • When you can define clear quality criteria for the LLM to evaluate against
  • When external evaluation is unavailable or impractical
  • Tasks where iterative refinement naturally improves quality (writing, code review)
  • When you want an audit trail of improvements (critique chain)

When NOT to Use

  • When first-pass quality is sufficient — reflection doubles the cost at minimum
  • When latency is critical — each iteration adds a full round-trip
  • When you have a reliable external evaluator — use Evaluator-Optimizer instead
  • For factual retrieval tasks — reflection can't fix missing knowledge; use RAG

Deeper Dive

  • Design — Critique prompt design, revision strategies, convergence detection, quality criteria
  • Implementation — Pseudocode, reflection prompts, iteration management, testing
  • Evolution — How reflection evolves from evaluator-optimizer