📊Evaluationalso: swe-bench, swe bench, swe-bench verified

SWE-bench

A benchmark from Princeton that tests coding agents on real GitHub issues — given the bug report and repo, the agent must produce a patch that passes the project's tests.

SWE-bench is the most relevant coding-agent benchmark in 2026. Unlike HumanEval (single-function generation), SWE-bench measures the full end-to-end task: read a GitHub issue, navigate the repo, write a patch, pass the original test suite. It mirrors what coding agents actually do in production.

The two variants matter: SWE-bench (2,294 issues, mixed quality) and SWE-bench Verified (500 hand-screened issues with reliable tests). Always quote Verified scores; unverified scores include many issues with broken or under-specified tests.

Leaderboard in early 2026: Devin, Claude with extended thinking, and Manus all sit at 60–75% on SWE-bench Verified. Cursor, Cline, and Codex CLI sit at 45–60%. The numbers improve every quarter; expect 80%+ frontier scores by end of 2026.

Where this shows up

💻Code agents

Frequently asked

What is the difference between SWE-bench and SWE-bench Verified?+

SWE-bench has 2,294 issues, many with under-specified or broken tests. Verified is the 500-issue subset screened by humans for test reliability. Always cite Verified scores when comparing models.

Should I pick a coding agent based on SWE-bench scores?+

Use it as a floor. Anything under 40% on Verified is suspect. Above 60%, the agent can do real work. Pair the benchmark with your own pilot eval on your actual codebase — public benchmarks tell you the ceiling, not the fit.

Agents that use swe-bench