WebArena
A benchmark of realistic web-task scenarios (e-commerce, social, content management) where agents are scored on completing real multi-step user goals through a real browser.
WebArena (CMU, 2023) hosts realistic web apps — a Reddit clone, an e-commerce site, a GitLab instance, an admin panel — and gives agents natural-language tasks ("find the cheapest 4-star hotel in Pittsburgh and book it"). Success requires actually navigating, filling forms, and verifying outcomes.
It is one of the more honest benchmarks for browser-use agents because tasks fail unless the end state is correct. Hallucinating "I booked the hotel" doesn't score; the booking record must actually exist.
In 2026, WebArena and VisualWebArena (the screenshot-based variant) are standard citations for any vendor claiming browser-agent capability. Top models cluster in the 20–40% success-rate range — still well below human baselines, which is part of the point.
Frequently asked
Why do agents score so low on WebArena?+
Because real websites are messy. The benchmark surfaces the gap between "the agent demoed great in a constrained environment" and "the agent works on the live web." Scores climb steadily but slowly.