📊Evaluationalso: agent bench, agent-bench benchmark

AgentBench

A multi-environment benchmark suite for LLM-as-agent performance — covers OS, database, web shopping, knowledge graph, card game, and lateral-thinking tasks across 8 environments.

AgentBench, introduced by Tsinghua and Anthropic researchers, scores LLMs on their ability to act as agents across realistic environments rather than answer trivia. The 8 environments include operating system shell, SQL database, knowledge graph, card game, lateral-thinking puzzles, and three web-based tasks.

It is one of the earlier "agent-native" benchmarks. The scoring penalizes hallucination and rewards multi-step planning, tool use, and recovery from errors — closer to real production conditions than MMLU-style trivia.

In 2026, AgentBench is one of three benchmarks (alongside GAIA and SWE-bench) that buyers cite when comparing agent platforms. Reasoning-tier models from OpenAI, Anthropic, and Google lead the leaderboard.

Frequently asked

How is AgentBench different from MMLU?+

MMLU measures knowledge — multiple-choice questions. AgentBench measures action — the model has to actually do things across 8 environments. Most strong MMLU models still struggle on AgentBench tasks.

Why are AgentBench scores so much lower than chat-LLM benchmarks?+

Agent tasks are end-to-end. A single missed tool call or wrong-format output fails the whole task. Even GPT-class models score 30–60% on AgentBench environments versus 85%+ on chat benchmarks.

Frequently asked

Related terms