Cost & Model Selection

Pattern selection determines what shape of cost you commit to. Model selection determines how much that shape costs per call. This doc covers both — the model-tier decision (Haiku / Sonnet / Opus / external), the per-pattern cost shape, and the guardrails that keep the bill bounded.

Per-pattern cost specifics still live in each pattern's cost-and-latency.md. This is the cross-cutting frame those documents specialize.

Model tier selection

Most agent stacks have access to three tiers from one provider plus selected external models. The right model for a step is rarely the most capable one available — it's the smallest one that does the job reliably enough.

Decision tree

  • Classification, routing, structured extraction, tool selection from a short list, refusal detection → fastest, cheapest tier (Haiku-class). These are constrained tasks where the model picks among a small enumeration; large models add latency, not accuracy.
  • General-purpose generation, summarization, multi-step reasoning, code generation for common tasks → mid tier (Sonnet-class). Default for most agent loops, including the loop itself in ReAct and Plan & Execute.
  • Long-horizon planning, deep multi-step reasoning, code generation for novel problems, eval grading where you need the grader to outperform the generator → highest tier (Opus-class). Reach for this when the failure mode of the mid tier is "the answer is plausible but wrong in subtle ways."
  • Specialized tasks → external models. Embedding generation, classification for safety / content moderation, OCR, speech, vision-only tasks. Use the right tool, not the universal one.

Mixed-tier within one agent

A well-designed agent uses multiple tiers. Examples:

  • ReAct loop: Sonnet for reasoning, Haiku for tool-output formatting, Opus for the final summary.
  • Multi-agent system: Sonnet workers, Opus supervisor, Haiku router.
  • Eval pipeline: Sonnet generator + Opus grader (the grader must outclass the generator to be useful).

Don't homogenize unless the cost difference is negligible. Per-call differences compound over loops.

When to reach for an external model

  • The task is OOD for the foundation models (specialized domain, non-English, structured-data-specific).
  • The economics flip — small specialized models can be 10× cheaper for the right task.
  • The data can't leave your perimeter, and you have the infra to host.

External models add operational overhead (deployment, fine-tuning pipelines, drift management). Justify the switch with a measured baseline, not a hunch.

Token budgets

A token budget is a structural cap, not a soft suggestion. Three layers:

  • Per-call budget. Each LLM call has a max-tokens cap on input and output. Models have hard limits; your budget should be tighter to fail fast on prompt explosions.
  • Per-session / per-request budget. Sum of tokens across all calls in one user-facing transaction. Catches runaway loops.
  • Per-day budget. Aggregate across the deployment, per cost center. Catches abuse and runaway batch jobs.

When a budget trips, options:

  • Fail the request and return a meaningful error.
  • Downgrade gracefully (truncate context, switch to a cheaper model, use a cached answer).
  • Escalate to a human queue.

The right choice depends on the use case. The wrong choice is no behavior — uncapped cost is a denial-of-wallet vulnerability. See Security & Safety.

Per-pattern cost shape

Patterns have characteristic cost profiles. Use this table to estimate before you build.

Pattern Cost shape Where cost concentrates
Prompt Chaining Linear in steps Total prompt size grows if each step accumulates context.
Parallel Calls N parallel + 1 aggregation Fan-out width is the lever.
Orchestrator-Worker 1 plan + N workers + 1 synth Plan and synth are expensive (full context); workers are cheaper.
Evaluator-Optimizer 2× per iteration; multiple iterations Cap iterations explicitly.
ReAct Variable per step; unbounded without cap Each step appends the prior observation; context grows.
Plan & Execute 1 plan + N steps + 0–1 replan Plan is expensive; per-step cost dominates over enough steps.
Tool Use 1 call per tool decision Tool execution cost (paid APIs) often exceeds LLM cost.
Memory Linear in retrieved memories Compression and summarization keep this bounded.
RAG Retrieval (cheap) + generation (medium) Generation cost scales with retrieved context size.
Reflection 2–N× per iteration Iteration cap is the lever.
Routing 1 classifier call + downstream pattern cost Classifier is cheap; downstream dominates.
Multi-Agent Sum of agent-call costs + orchestration overhead Supervisor + N workers; supervisor cost is per-handoff.
Event-Driven Per-event agent cost Event rate × per-event cost; rate-limit per partition.
Saga Steps + potential compensations Compensations amplify cost on failure.
Human in the Loop Low LLM cost; high human-time cost Cost shape changes; budget human review time.
Long-Horizon Per-tick model cost + storage over task lifetime Cost accumulates over days/weeks even when idle; recap + virtual FS are the levers.
Agentic RAG Per-question: decompose + (retrieve + score + reflect) × sub-questions + compose 3–10× baseline RAG cost; tier the loop with cheaper models for relevance scoring.
Sub-agents One agent loop per spawn; per-role model selection Often lower total cost than a monolithic agent thanks to per-role tiering.
Guardrails Per-layer detectors + optional quarantined-LLM call per untrusted tool result Paid per request, every request; dual-LLM is the dominant cost driver.

The latency / cost / quality triangle

Pick two. Most architectural decisions trade off one of these against the others:

  • Faster + cheaper → fewer steps, smaller model, no iteration (Prompt Chaining, Tool Use with Haiku).
  • Faster + higher-quality → larger model, no iteration, parallel calls (Parallel Calls with Opus).
  • Higher-quality + cheaper → smaller model with iteration (Evaluator-Optimizer with Sonnet, Reflection with Sonnet).

There's no path to fast + cheap + high-quality — claims to the contrary are almost always undisclosed tradeoffs (lower quality bar, cached answers, narrow domain).

When stakeholders push for all three: name the tradeoff explicitly and pick the right two.

Cost guardrails

Guardrails make cost predictable. The set that consistently works:

  • Iteration caps on every loop pattern. max_steps is the single most important agent parameter.
  • Token caps per call and per session.
  • Tool-call caps per agent instance per minute. Catches infinite-tool-loop bugs.
  • Daily aggregate caps with alerting at 50% / 80% / 100% of budget.
  • Per-tenant / per-user budgets for multi-tenant systems.
  • Per-environment budgets. Dev / staging should have hard caps below production — dev mistakes shouldn't bankrupt the team.

Alerting is not a cap. A cap stops cost; an alert tells you that cost is happening. You need both.

What's deferred to `agent-deployments`

The operational cost layer — provider-level rate limits, autoscaling policy, autoscaling triggers, cost dashboards, alerting integration — lives in `agent-deployments/docs/cross-cutting/observability.md` and `docs/capabilities/obs/`. This doc covers what to think about at the architecture layer; the deployment layer covers what to instrument.

What this guide deliberately doesn't cover

  • Specific dollar figures per model. Provider pricing changes; the patterns don't.
  • Provider-specific batch / cache / commitment pricing. Read the provider's docs.
  • Cost modeling spreadsheets — those are organizational artifacts, not architectural ones.
  • "Which provider is cheapest" — depends on workload shape and changes too fast to enshrine in docs.