SWE-bench
A benchmark from Princeton that tests coding agents on real GitHub issues — given the bug report and repo, the agent must produce a patch that passes the project's tests.
SWE-bench is the most relevant coding-agent benchmark in 2026. Unlike HumanEval (single-function generation), SWE-bench measures the full end-to-end task: read a GitHub issue, navigate the repo, write a patch, pass the original test suite. It mirrors what coding agents actually do in production.
The two variants matter: SWE-bench (2,294 issues, mixed quality) and SWE-bench Verified (500 hand-screened issues with reliable tests). Always quote Verified scores; unverified scores include many issues with broken or under-specified tests.
Leaderboard in early 2026: Devin, Claude with extended thinking, and Manus all sit at 60–75% on SWE-bench Verified. Cursor, Cline, and Codex CLI sit at 45–60%. The numbers improve every quarter; expect 80%+ frontier scores by end of 2026.
Where this shows up
Frequently asked
What is the difference between SWE-bench and SWE-bench Verified?+
SWE-bench has 2,294 issues, many with under-specified or broken tests. Verified is the 500-issue subset screened by humans for test reliability. Always cite Verified scores when comparing models.
Should I pick a coding agent based on SWE-bench scores?+
Use it as a floor. Anything under 40% on Verified is suspect. Above 60%, the agent can do real work. Pair the benchmark with your own pilot eval on your actual codebase — public benchmarks tell you the ceiling, not the fit.