Testing AI agents is what separates production-grade deployments from demos that work on the vendor's data. This is the practical guide to actually setting up evaluations that catch regressions before they reach production.
TLDR — the framework
A working agent-eval setup has three layers:
- Eval set — your curated test cases (input + expected behavior)
- Eval runner — software that runs the agent against the eval set and scores each case
- CI/CD integration — every PR runs the evals; regressions block merge
Plus observability for production traffic (separate but related — see LLM observability).
Why eval frameworks exist
Without evals:
- You ship a prompt change, can't tell if it broke things until customers complain
- You can't compare model A vs model B objectively
- You can't tell if the agent got better or worse this quarter
- You're optimizing on vibes
With evals:
- Every change is measured against the same test set
- You can see at a glance: "we went from 67% task-success to 71%"
- Regressions caught in CI, not in production
- Model upgrades are evaluable, not just hopes
The 2026 leading eval frameworks
Braintrust
The fastest-growing dedicated agent-eval platform in 2026. Strong on:
- Side-by-side comparisons of model/prompt variants
- Custom scoring functions (you bring the criteria)
- Replay traces from production into evals
- Pricing: free tier + paid above 10K experiments/month
LangSmith
LangChain's eval + observability product. Strong on:
- Integration with LangChain/LangGraph agents
- Production trace replay
- Built-in scorers + LLM-as-judge templates
- Pricing: $39/mo developer; team + enterprise tiers
Helicone
Observability-first, eval as a bonus. Strong on:
- Production traffic monitoring + replay
- Cost + latency tracking
- Open source
- Pricing: free up to 100K requests/mo
OpenAI Evals
Open-source framework from OpenAI. Strong on:
- No platform lock-in
- YAML + Python eval definitions
- Free
- Less polished UI than commercial alternatives
Custom (no framework)
For simple cases: a Python script + pytest assertions + a CSV of test cases. Works fine for < 100 test cases. Breaks down beyond that.
How to build your eval set
The hardest part. Most teams overthink this.
Start small + concrete (week 1)
- 20 easy cases (the obvious wins)
- 15 medium cases (the bread + butter)
- 10 hard cases (your real edge cases)
- 5 negative cases (the agent should refuse or escalate)
Total: 50 cases. Hand-write each. Define the expected behavior precisely.
Add from production (week 2 onwards)
Every time the agent produces a bad output in production, add that case to the eval set:
- The exact input that caused the failure
- The expected behavior (what should the agent have done)
- Optionally: the actual behavior + why it was wrong
Within 3 months you'll have 500+ cases capturing your real-world failure modes.
Categorize
Tag each case:
- Difficulty (easy/medium/hard)
- Category (refunds / FAQ / billing / etc.)
- Source (hand-curated / production-failure / regression-from-incident)
Lets you slice eval results meaningfully ("we improved 5% on hard cases" vs. "we degraded 3% on refunds specifically").
What to measure
Task-success is the basic metric. Then add as the agent stabilizes:
Task-success
Did the agent reach the right outcome? Binary for simple tasks; graded (0/0.5/1) for tasks with partial credit.
The measurement method:
- Exact-match for structured outputs (e.g., "did the agent extract the right customer ID?")
- LLM-as-judge for free-form outputs (e.g., "does the agent's response correctly address the customer's question?"). See llm-as-a-judge glossary.
- Human-review for the gold-standard subset (10-20% of cases reviewed by humans monthly to calibrate the LLM judge)
Faithfulness + groundedness
Does the answer cite real sources? Are facts grounded in the provided context vs. hallucinated? Critical for RAG-flavored agents.
Latency
P50 + P95 + P99 latency per task. Catch regressions where a prompt change makes the agent 3x slower.
Cost per task
Total LLM API spend per task. Catch regressions where switching to a "better" model 5x'd costs without proportional quality gain.
User satisfaction (where available)
For agents in production with user feedback (thumbs up/down, CSAT post-conversation), correlate against your eval scores. The agent that scored 85% on your eval but customers hate is a sign your eval doesn't match what customers value.
The eval-as-CI pattern
The real win. Every PR runs evals:
# .github/workflows/agent-evals.yml (conceptual)
name: Agent Evals
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run eval suite
run: |
npm ci
npm run eval:full
- name: Check regression
run: |
# Fail the build if task-success drops more than 2% vs. main
npm run eval:compare-to-main --threshold=-0.02
Now:
- Every prompt change is measured
- Reviewers can see eval impact in the PR
- Regressions block merge automatically
This is the eval pattern that separates production teams from prototype teams.
Common eval-design mistakes
1. Evals that don't match production
Your eval set is from 6 months ago; production traffic shifted. Eval scores are great; users are unhappy. Fix: continuously add production-failure cases.
2. Overfitting to the eval
You tune the prompt until eval scores are 95%; production performance is unchanged. Fix: hold out 20% of eval cases as a "test set" that you don't tune against.
3. LLM-as-judge bias
The judging LLM (GPT-4 evaluating your agent's output) has its own biases. Fix: calibrate against human review periodically; use multiple judge models when stakes are high.
4. Ignoring cost + latency
You optimize for quality; spend triples and users hate the new latency. Fix: track all four metrics; don't ship if cost regression > 2x without proportional quality gain.
5. No reproducibility
The eval is non-deterministic (temperature > 0); scores swing 5% between runs. Fix: temperature=0 for eval runs; or run each case 3x and report median.
Building your eval flywheel
The pattern that compounds:
- Daily: developers run evals locally before PR
- Per-PR: CI runs full eval suite + blocks regressions
- Weekly: review production failures + add new eval cases
- Monthly: human review of 50-100 random cases to calibrate the LLM judge
- Quarterly: re-baseline (the eval set should ~double in size every 3 months)
After 6-12 months of this, you have hundreds of cases, a stable evaluation harness, and a real measurement of agent quality. The compound effect on agent reliability is the difference between "demo product" and "production system."
See also
- LLM observability glossary
- LLM as a judge
- AI evals glossary
- Agent observability
- How to evaluate an AI agent
Bottom line
Evals are the difference between agents that work in demos and agents that work in production. The setup cost is real (1-2 weeks for the framework + initial eval set) but the ongoing cost is small. The teams who do this ship reliable agents; the teams who don't ship demos. Pick the framework that fits your stack and start with 50 hand-curated cases this week.