📊Evaluationalso: web arena, webarena benchmark, visualwebarena

WebArena

A benchmark of realistic web-task scenarios (e-commerce, social, content management) where agents are scored on completing real multi-step user goals through a real browser.

WebArena (CMU, 2023) hosts realistic web apps — a Reddit clone, an e-commerce site, a GitLab instance, an admin panel — and gives agents natural-language tasks ("find the cheapest 4-star hotel in Pittsburgh and book it"). Success requires actually navigating, filling forms, and verifying outcomes.

It is one of the more honest benchmarks for browser-use agents because tasks fail unless the end state is correct. Hallucinating "I booked the hotel" doesn't score; the booking record must actually exist.

In 2026, WebArena and VisualWebArena (the screenshot-based variant) are standard citations for any vendor claiming browser-agent capability. Top models cluster in the 20–40% success-rate range — still well below human baselines, which is part of the point.

Frequently asked

Why do agents score so low on WebArena?+

Because real websites are messy. The benchmark surfaces the gap between "the agent demoed great in a constrained environment" and "the agent works on the live web." Scores climb steadily but slowly.

Frequently asked

Related terms