Evaluate AI agents on outcomes, not features. Pilot 30 days on real workflows. Public benchmarks are a floor — pilot evals are the truth.
The 5-step agent evaluation framework
Step 1: Define success criteria (1 day)
Before the pilot, write down:
Outcome metrics — what does success look like?
- Coding agent: PRs merged without revision rate
- Support agent: tier-1 deflection rate at >X% CSAT
- Sales agent: qualified meetings booked per week
Cost ceiling — what's the max you'd pay per outcome?
Failure modes you can't accept — security breaches, sensitive data leaks, customer-facing errors.
Step 2: Measure baseline (2 weeks)
Run the workflow manually (or with current tools) for 2 weeks. Collect:
- Outcome rate per week
- Time per outcome
- Cost per outcome (fully loaded)
- Quality variance
Without baseline, you can't measure improvement.
Step 3: Run public benchmark check (1 hour)
For coding agents → SWE-bench Verified. Anything under 40% is suspect.
For general agents → GAIA benchmark. Anything under 30% means weak general capability.
For support agents → published deflection-rate case studies from the vendor.
Public benchmarks tell you the ceiling. They don't predict fit.
Step 4: Run a 30-day pilot (30 days)
Real workflows, real users (if customer-facing), real failure conditions.
Measure daily:
- Outcome rate
- Cost per outcome
- Categorize failures (prompt issue, tool error, hallucination, user confusion)
Compare to baseline:
- Better, worse, or same on each metric?
- Better in some dimensions but worse in others?
Step 5: Make the decision (1 day)
Buy if:
- Outcome rate ≥ 80% of baseline (or better)
- Cost per outcome ≤ baseline
- No accept-zero failure modes triggered
- Users prefer the agent-driven workflow
Don't buy if any of:
- Outcome rate < 60% of baseline (too risky)
- Cost per outcome > baseline (negative ROI)
- Critical failure modes triggered even once
- Users prefer the old workflow
The eval suite for production agents
If you're building your own agent (not buying), you need an eval suite. See our deep dive on AI evals.
Minimum viable eval suite:
- 50 test cases across top intents
- Each with input + expected output (or grading rubric)
- Run on every model swap or prompt change
- Tracked over time in observability tool
Tools: Braintrust, Promptfoo, LangSmith.
Common evaluation mistakes
1. Measuring process not outcome. "Response time improved" — but did the customer's problem actually get solved?
2. Trusting demos. Vendor demos are curated. Always pilot.
3. Single-week evaluations. Agents drift. Need 4 weeks minimum to see real behavior.
4. No baseline. Without baseline, you can't tell if "75% success" is good or bad.
5. Ignoring edge cases. Edge cases are what break agents in production. Specifically design tests for them.
The verdict
Evaluation = baseline + benchmark check + 30-day pilot + outcome math. Skip any and you're buying based on marketing.
For more see How to deploy an AI agent, How to build an AI agent, and The 15 best AI agents in 2026.