Eval
A systematic test that measures agent performance on a fixed set of inputs — the agent equivalent of a test suite.
You can't improve what you don't measure. Production agent stacks in 2026 ship with eval suites: hundreds to thousands of input/expected-output pairs, run on every model change to catch regressions.
Good evals are *unit tests with judgment*. They have specific inputs, specific success criteria (usually graded by a stronger model), and run automatically in CI.
The discipline is new enough that "we have evals" is still a real differentiator between mature teams and the rest.
Frequently asked
How many eval cases do I need?+
For a focused agent, 100–500 cases covering the top intents is usually enough to catch the obvious regressions. Past a thousand cases the marginal value drops sharply.