In partnership with

Help us make better ads

Did you recently see an ad for beehiiv in a newsletter? We’re running a short brand lift survey to understand what’s actually breaking through (and what’s not).

It takes about 20 seconds, the questions are super easy, and your feedback directly helps us improve how we show up in the newsletters you read and love.

If you’ve got a few moments, we’d really appreciate your insight.

How to Evaluate AI Agents

Research Audio Weekly

How to Evaluate AI Agents

Anthropic's engineering team on evaluation design, grader types, and common failure modes

Anthropic's engineering team published a detailed guide on evaluating AI agents. The guide draws from internal work on Claude Code and collaborations with companies including Stripe, Shopify, Bolt, and Sierra.

This breakdown covers the core concepts, frameworks, and implementation details.

Why Evaluations Break Down at Scale

Teams building AI agents typically follow a predictable trajectory. In the early stages, manual testing, dogfooding, and intuition carry development forward. The team ships features, gathers feedback, and iterates quickly.

Then comes the breaking point.

Users start reporting that the agent "feels worse" after changes. The team has no way to verify this except through manual spot-checking. Debugging becomes entirely reactive: wait for complaints, reproduce manually, fix the bug, hope nothing else regressed. There's no way to distinguish real regressions from noise, no ability to automatically test changes against hundreds of scenarios before shipping, and no way to measure whether improvements actually improved anything.

The core insight: Without evaluations, fixing one failure often creates others. Teams get stuck in reactive loops, catching issues only in production where the cost of mistakes is highest.

Evaluations break this cycle by making problems and behavioral changes visible before they affect users. Their value compounds over the lifecycle of an agent—early investment pays dividends at every subsequent stage.

Single-Turn vs Agent Evaluations

Traditional LLM evaluations follow a simple pattern: send a prompt, receive a response, apply grading logic. This worked well for earlier language models focused on single-turn interactions.

Agent evaluations are fundamentally different. Agents operate over many turns, calling tools, modifying state, and adapting based on intermediate results. These same capabilities that make agents useful—autonomy, intelligence, flexibility—also make them dramatically harder to evaluate.

Single-turn evaluations test one prompt-response pair. Agent evaluations must handle tools, environments, multi-step execution, and state changes.

A critical finding from Anthropic's work: frontier models can find creative solutions that surpass static evaluations. In one case, Opus 4.5 "failed" a τ2-bench flight booking problem by discovering a policy loophole—technically failing the evaluation as written, but actually producing a better solution for the user.

This reveals a fundamental tension in agent evaluation: you want graders strict enough to catch real problems but flexible enough to reward genuinely creative solutions.

Essential Evaluation Terminology

Before diving deeper, understanding these terms precisely matters for implementation:

Term Definition
Task A single test with defined inputs and success criteria. Also called a problem or test case.
Trial Each attempt at a task. Run multiple trials to account for non-deterministic model outputs.
Grader Logic that scores some aspect of performance. A task can have multiple graders, each with multiple assertions.
Transcript Complete record of a trial: outputs, tool calls, reasoning, intermediate results. For the Anthropic API, this is the full messages array.
Outcome Final state in the environment. The agent might say "Your flight is booked" but the outcome is whether a reservation exists in the database.
Evaluation Harness Infrastructure that runs evals end-to-end: provides instructions and tools, runs tasks concurrently, records steps, grades outputs, aggregates results.
Agent Harness The system enabling a model to act as an agent: processes inputs, orchestrates tool calls, returns results. When evaluating "an agent," you're evaluating harness + model together.
Evaluation Suite Collection of tasks measuring specific capabilities. Tasks typically share a broad goal (e.g., customer support suite testing refunds, cancellations, escalations).

The Three Types of Graders

Effective agent evaluations combine three grader types. Each evaluates either the transcript (what happened) or the outcome (final state). Choosing the right graders for each task is one of the most important design decisions in building evaluation systems.

Anthropic's approach: Use deterministic graders where possible, LLM graders where necessary for flexibility, and human graders for validation and calibration.

A common instinct is to check that agents followed specific steps—a particular sequence of tool calls in the right order. Anthropic found this approach too rigid, resulting in brittle tests. Agents regularly find valid approaches that evaluation designers didn't anticipate. Grade what the agent produced, not the path it took.

For tasks with multiple components, build in partial credit. A support agent that correctly identifies a problem and verifies the customer but fails to process a refund is meaningfully better than one that fails immediately. Represent this continuum in your scoring.

Capability vs Regression Evaluations

These two evaluation types serve fundamentally different purposes and should be designed accordingly:

Capability evaluations ask "what can this agent do well?" They should start at a low pass rate, targeting tasks the agent currently struggles with. This gives teams a clear hill to climb. As you improve the agent, capability eval scores should increase.

Regression evaluations ask "does the agent still handle tasks it used to?" These should maintain a nearly 100% pass rate. A decline signals something is broken. Regression evals protect against backsliding as you hill-climb on capability evals.

The natural lifecycle: after an agent is optimized, capability evals with high pass rates "graduate" to become regression suites. Tasks that once measured "can we do this at all?" then measure "can we still do this reliably?"

Evaluation Strategies by Agent Type

Coding Agents

Coding agents write, test, and debug code, navigate codebases, and run commands. Deterministic graders are natural here: does the code run? Do tests pass?

Two widely-used benchmarks demonstrate the approach. SWE-bench Verified gives agents GitHub issues from popular Python repositories and grades solutions by running the test suite—passing only if it fixes failing tests without breaking existing ones. LLMs progressed from 40% to over 80% on this benchmark in one year. Terminal-Bench tests end-to-end technical tasks like building a Linux kernel from source.

Beyond pass/fail tests, consider grading the transcript: heuristics-based code quality rules, model-based graders for behaviors like tool usage patterns or user interaction quality.

Conversational Agents

Conversational agents interact in domains like support, sales, or coaching. Unlike coding agents, the quality of the interaction itself is part of what you're evaluating. They often require a second LLM to simulate the user.

Success is multidimensional: Is the ticket resolved (state check)? Did it finish in under 10 turns (transcript constraint)? Was the tone appropriate (LLM rubric)? Benchmarks like τ-Bench and τ2-Bench simulate multi-turn interactions across domains, where one model plays a user persona while the agent navigates realistic scenarios.

Research Agents

Research agents gather, synthesize, and analyze information. These are the hardest to evaluate—experts may disagree on what counts as "comprehensive" or "well-sourced."

Combine grader types: groundedness checks verify claims are supported by retrieved sources, coverage checks define key facts a good answer must include, source quality checks confirm authoritative sources were consulted. For objectively correct answers, exact match works. LLM judges can flag unsupported claims and verify synthesis coherence.

Given the subjective nature of research quality, LLM-based rubrics should be frequently calibrated against expert human judgment.

Computer Use Agents

Computer use agents interact through screenshots, mouse clicks, keyboard input, and scrolling—the same interface as humans. Evaluation requires running the agent in a real or sandboxed environment.

WebArena tests browser-based tasks using URL and page state checks, plus backend state verification for tasks that modify data (confirming an order was placed, not just that a confirmation page appeared). OSWorld extends this to full operating system control.

Browser use agents require balancing token efficiency against latency. Claude for Chrome developed evals to check the agent was selecting the right interaction method for each context, completing tasks faster and more accurately.

Handling Non-Determinism: pass@k vs pass^k

Agent behavior varies between runs. A task that passed on one run might fail on the next. Sometimes what matters is how often an agent succeeds. Two metrics capture this:

pass@k measures the likelihood of at least one correct solution in k attempts. As k increases, pass@k rises—more shots on goal means higher odds of at least one success. A 50% pass@1 score means the model succeeds at half the tasks on its first try.

pass^k measures the probability that ALL k trials succeed. As k increases, pass^k falls—demanding consistency across more trials is harder. With a 75% per-trial success rate and 3 trials, the probability of passing all three is (0.75)³ ≈ 42%.

At k=1, both metrics are identical (both equal the per-trial success rate). By k=10, they tell opposite stories: pass@k approaches 100% while pass^k falls toward 0%.

Which to use depends on product requirements: pass@k for tools where finding one good solution matters, pass^k for customer-facing agents where users expect reliable behavior every single time.

8 Steps to Building Useful Evals

This roadmap reflects Anthropic's approach to going from zero evaluations to a working system.

Step 0: Start Early

Teams delay building evals thinking they need hundreds of tasks. In reality, 20-50 simple tasks from real failures is sufficient to begin. In early development, each change has noticeable impact—large effect sizes mean small sample sizes suffice. Evals get harder to build the longer you wait. Early on, product requirements naturally translate into test cases.

Step 1: Start with Manual Checks

Begin with behaviors you already verify before each release, common tasks end users attempt. If in production, mine your bug tracker and support queue. Converting user-reported failures into test cases ensures your suite reflects actual usage patterns. Prioritize by user impact.

Step 2: Write Unambiguous Tasks

A good task is one where two domain experts would independently reach the same pass/fail verdict. Could they pass the task themselves? If not, refine it. Ambiguity in task specifications becomes noise in metrics. Create a reference solution for each task: a known-working output that passes all graders, proving the task is solvable and verifying graders are correctly configured.

Step 3: Build Balanced Problem Sets

Test both cases where a behavior should occur AND where it shouldn't. One-sided evals create one-sided optimization. If you only test whether the agent searches when it should, you might end up with an agent that searches for everything. Avoid class-imbalanced evaluation suites.

Step 4: Build Robust Infrastructure

The agent in evaluation must function the same as in production. Each trial should start from a clean environment—shared state between runs (leftover files, cached data, resource exhaustion) causes correlated failures from infrastructure flakiness rather than agent performance. In some internal evals, Anthropic observed Claude gaining unfair advantages by examining git history from previous trials.

Step 5: Design Graders Thoughtfully

Don't check for specific tool call sequences—agents find valid approaches you didn't anticipate. Build in partial credit. For model grading, closely calibrate LLM-as-judge graders with human experts. Give the LLM a way out (return "Unknown" when uncertain). Create clear, structured rubrics for each dimension, using isolated LLM judges per dimension rather than one grader for everything.

Step 6: Check the Transcripts

You won't know if graders are working without reading transcripts and grades from many trials. When a task fails, the transcript tells you whether the agent made a genuine mistake or your graders rejected a valid solution. Failures should seem fair—it should be clear what the agent got wrong and why. Reading transcripts is how you verify your eval measures what actually matters.

Step 7: Monitor for Saturation

An eval at 100% tracks regressions but provides no signal for improvement. As evals approach saturation, only the most difficult tasks remain, making large capability improvements appear as small score increases. With frontier models, a 0% pass rate across many trials (0% pass@100) most often signals a broken task, not an incapable agent.

Step 8: Maintain Long-Term

An eval suite needs ongoing attention and clear ownership. Anthropic found that dedicated eval teams should own core infrastructure while domain experts and product teams contribute most tasks and run evaluations themselves. Practice eval-driven development: build evals to define planned capabilities before agents can fulfill them, then iterate until performance improves.

Layered Evaluation: The Swiss Cheese Model

Automated evaluations represent just one layer of understanding agent performance. No single evaluation method catches every issue. Like the Swiss Cheese Model from safety engineering, failures that slip through one layer should be caught by another.

Issues that slip through one evaluation layer get caught by another. Multiple methods combined provide comprehensive coverage.

Each method maps to different development stages:

  • Automated evals: Pre-launch and CI/CD. Run on each agent change and model upgrade as the first line of defense.
  • Production monitoring: Post-launch. Detect distribution drift and unanticipated real-world failures.
  • A/B testing: Validate significant changes once you have sufficient traffic.
  • Transcript review: Ongoing practice. Triage feedback constantly, sample transcripts weekly, dig deeper as needed.
  • Human studies: Reserve for calibrating LLM graders or evaluating subjective outputs where human consensus serves as the reference standard.

Summary

  • Start with 20-50 tasks from real failures
  • Grade outcomes, not paths—agents find solutions you didn't anticipate
  • Combine grader types: deterministic where possible, LLM where needed, human for calibration
  • Build balanced problem sets testing both when behaviors should and shouldn't occur
  • Read the transcripts to verify evals measure what matters
  • Use pass@k when one solution matters, pass^k when consistency matters
  • Define capabilities in evals before agents can fulfill them

Read the full guide →

Includes example YAML configurations and benchmark recommendations

Until next time,

Deep

Keep Reading