Evals & Quality

Evals are tests for systems where correctness is statistical, not Boolean. This doc covers how to design eval suites, how to pick metrics, and how to run evals in CI rather than as occasional reports.

It builds on, and does not duplicate, Testing Strategies, which covers the test pyramid (unit, component, integration, eval) and what each layer mocks. The split:

Testing Strategies — where evals fit in the pyramid and how the other layers complement them.
This doc — how the eval layer specifically is built, run, and maintained.

Evals are tests, not benchmarks

A benchmark is a one-time measurement against an external dataset. An eval suite is a continuously-run test suite the team owns, drawn from real use cases, that fails the build when output quality drops.

The mindset shift:

Benchmarks compare models. Evals compare your system across changes.
Benchmarks publish a number. Evals gate deployments.
Benchmarks are static. Evals grow with every observed production failure.

If your "evals" only run before a model swap, they're benchmarks. Move them into CI.

Designing a golden dataset

A golden dataset is the input side of the eval suite — inputs paired with expected behavior. Three properties matter more than size:

Coverage of real distributions. Sampled from production traffic (with privacy review) rather than synthetic. Include the long tail, not just the headline use cases.
Coverage of failure modes. Every production incident becomes a row. The golden dataset is also the regression suite.
Ground-truth honesty. Where ground truth is ambiguous (open-ended generation), record the acceptance criteria, not a single expected string. Model-graded evals score against criteria, not exact match.

Concrete inclusions:

Happy-path inputs covering each major intent.
Adversarial inputs (prompt-injection attempts, ambiguous inputs, refusal cases).
Inputs that should produce abstention / "I don't know" outputs.
Inputs that should escalate to human review (HITL patterns).
Edge-case inputs from incidents.

Keeping it from getting stale

Append every production failure. When you fix a bug, the input goes into the golden dataset before you close the ticket.
Sample monthly. Replace the oldest 10% with fresh production samples to track distribution drift.
Decay obsolete cases. If a case has been passing for a year and the underlying feature is stable, demote it to a smoke test.

Handling ambiguous ground truth

Acceptance criteria over exact match. "The answer mentions {fact A, fact B} and does not claim X" beats "The answer equals: …"
Multiple acceptable answers. Record several valid outputs; pass if the system matches any.
Confidence ranges. "Confidence between 0.7 and 0.95" is a better assertion than "Confidence equals 0.85".
When in doubt, model-graded. A second LLM call scores against the criteria. See below.

Metric selection

Three families of metrics, used together:

Rule-based metrics

Deterministic checks: schema validity, JSON parsability, presence of required fields, citation presence, tool-call legitimacy (function in allow-list, args match schema), refusal detected on refuse-list cases.

Cheap and fast. Run on every input in the suite.
No false positives — if the schema is wrong, the schema is wrong.
Limited coverage. Catches structural failures; misses semantic ones.

Model-graded metrics

A second LLM call scores the output against criteria. Cheap relative to human review, fast, and surprisingly reliable for well-defined criteria. Use it for:

Faithfulness ("does the answer follow from the retrieved context?")
Completeness ("does the answer address all parts of the question?")
Tone / style ("does the answer follow the brand voice?")
Refusal correctness ("did the agent refuse for the right reason?")

Watch-outs:

Model-graded metrics drift when you change the grading model. Pin the grader's model and prompt; treat changes as their own deployment.
A weak grader can rubber-stamp a weak generator. Use a stronger model as the grader than as the generator where possible.
Grader prompts are themselves prompts — they need their own eval cases.

Human-graded metrics

Slow, expensive, gold standard. Reserve for:

Calibrating model-graded metrics (sample 5-10% of model-graded outputs for human review; track agreement rate).
High-stakes outputs (legal, medical, financial).
New eval categories where the criteria are still being refined.

Don't try to scale human grading by hiring more graders — scale it by improving the model-graded substitute until human review becomes a calibration sample, not a workflow.

Online vs offline evals

Offline evals

Pre-merge, in CI. Catches regressions against the golden dataset before they ship.

Run on every PR that touches prompts, model selection, retrieval, or tools.
Hard-fail thresholds for regressions (e.g. faithfulness drops > 2 percentage points), not for absolute scores.
Comment a results table on the PR for reviewer context.

Online evals

Post-deploy, sampled from production traffic.

Catches distribution shift the offline suite doesn't model.
Surfaces patterns that should land in the golden dataset.
Detects silent regressions caused by upstream model updates.

Use online evals to find the cases that belong in offline evals. The two layers feed each other.

Eval cost budgets

Evals cost LLM calls — sometimes a lot. Sizing the suite is itself an engineering decision.

A rough heuristic:

Offline suite, per PR: 50–500 cases. Costs $1–$50 per PR depending on model size and case complexity.
Online sampling, per day: 1–5% of production traffic. Costs scale with traffic.
Human review, calibration: 50–100 cases per quarter. Costs human time, not LLM tokens.

If your offline suite is too expensive to run on every PR, the problem is usually that the suite is full of cases the cheaper layers (rule-based, smoke tests) should have caught. Tier the suite — quick checks on every PR, full suite on merge to a release branch.

Regression suites

The single most valuable eval discipline: every production failure becomes a test case. When the on-call engineer fixes a bug, the input goes into the golden dataset and the PR is gated on it.

This is how a system stays good. Without it, every fix invites a regression that lands in the next deploy. With it, the system's quality floor only goes up.

Operational rule: a fix without an eval is incomplete. Block the PR.

The reliability gap (and why cadence matters)

A persistent 2026 finding across enterprise agent deployments: task-completion benchmarks systematically overestimate production reliability. Reports place the gap between published benchmark scores and observed production success at ~37 percentage points, with the dominant failure modes being long-horizon brittleness, tool-error compounding, and silent degradation under traffic the benchmark didn't sample.

This is why offline + online evals together are not optional — and why cadence is the lever that matters most.

Eval cadence	Typical detection latency	What it catches
Per-PR offline suite	Minutes	Direct regressions you authored
Daily online sample	Hours	Upstream-model drift, prompt-cache misses, new traffic patterns
Weekly full eval cut	Days	Slow drift, long-tail regressions, cumulative behavior change
Monthly only	Weeks	Mostly catches catastrophes after customers find them first

Teams that run their full eval suite weekly report meaningfully fewer production issues than teams running it monthly (one widely-cited 2026 enterprise survey put the reduction at roughly 22%). The direction matters more than the exact number: the gap between benchmark and production closes with cadence, not with bigger one-time evals.

Practical consequences:

Treat single-run benchmark scores (AgentBench, ToolBench, API-Bank) as upper bounds, not targets. Apply a "production discount" when you cite them.
Include reliability axes the headline benchmarks don't measure: cost efficiency, step efficiency, plan adherence, trace consistency, refusal correctness, abstention rate. See agent-deployments/docs/cross-cutting/observability.md for the operational instrumentation that feeds these.
Long-horizon patterns (patterns/long_horizon/) need their own reliability lens — a "task completion rate" that ignores how many resumes a task took hides exactly the reliability gap that matters at scale. Track resumes per task, replan rate, stuck-task rate alongside completion.

If your published metric and your on-call pager tell different stories, the pager is right.

Where each eval discipline applies, by pattern

Pattern	Primary eval signal
Prompt Chaining	Per-step schema validity; end-to-end correctness.
Parallel Calls	Per-branch correctness; aggregation faithfulness.
Orchestrator-Worker	Decomposition quality (does the plan cover the task?); worker output quality.
Evaluator-Optimizer	The pattern includes its own evaluator — but that evaluator needs evals too.
ReAct	Tool-call correctness; iteration count distribution; final answer quality.
Plan & Execute	Plan quality; per-step execution fidelity.
Tool Use	Function selection accuracy; argument schema match; refusal of unknown tools.
Memory	Retrieval relevance over stored memories; consistency across sessions.
RAG	Retrieval recall; faithfulness to retrieved context; citation presence.
Reflection	Critic accuracy; improvement rate per iteration.
Routing	Classification accuracy; `unknown`/`escalate` recall.
Multi-Agent	Per-agent correctness; cross-agent consistency; orchestration overhead.
Event-Driven	Idempotency under replay; correctness over event order permutations.
Saga	Compensation correctness under simulated failures at each step.
Human in the Loop	Approval-rate signal; correct routing of high-stakes cases to human review.
Long-Horizon	Task completion rate AND resumes-per-task; replan rate; stuck-task rate; idempotency under retry.
Agentic RAG	Citation precision; cross-source consistency; abstention rate on out-of-corpus queries.
Sub-agents	Per-role schema-result validity; cap-hit rate; tool-grant violations (must be zero).
Guardrails	Per-detector FP rate (calibrated); shadow-mode disagreement; bypass-use audit; layer latency.

Testing Strategies — the test pyramid this layer sits in.
Hallucination & Grounding — what to evaluate for; abstention is a positive signal, not a failure.
Evaluator-Optimizer, Reflection — patterns that embed evaluation as part of generation.
Security & Safety — adversarial eval cases.

What this guide deliberately doesn't cover

Specific eval frameworks (Promptfoo, Inspect, Braintrust, custom). The discipline matters more than the tool.
Statistical tests for "is this regression significant?" — usually overkill for the sample sizes a team can afford; trend lines and human judgment carry most of the load.
Per-domain accuracy targets. Those are organizational decisions.
Comparison evals across models — that's benchmarking, not evals as defined here.