Key Takeaways

  • ClawBench is a new benchmark testing AI agents on 153 tasks across 144 live, real-world production websites
  • Tasks include booking appointments, completing purchases, and submitting job applications
  • Claude Sonnet 4.6 achieved the best score at 33.3% — meaning it failed on roughly 2 in 3 tasks
  • Unlike previous benchmarks run in sandboxes, ClawBench operates on real websites with real consequences
  • Captures five layers of data per run: session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions

The Facts

Researchers from UBC and the Vector Institute have released ClawBench, a new evaluation framework that tests AI agents on tasks that would look completely ordinary to any internet user — but have proven surprisingly difficult for today's frontier AI models.

The benchmark covers 153 distinct tasks spread across 144 live production websites in 15 categories. These are not synthetic exercises: the tasks include completing real purchases on actual retail sites, booking real appointments on healthcare scheduling platforms, and submitting real job applications through career portals. The evaluation is designed to intercept only the final submission request — preventing real-world side effects during testing — while operating on fully live websites rather than static replicas or sandboxes.

The best-performing model on ClawBench is Claude Sonnet 4.6, which achieved a completion rate of 33.3%. The implications of that number deserve a moment's consideration: the leading frontier AI model, on a benchmark designed to test exactly the kind of autonomous web tasks that are frequently discussed as AI's near-term capability horizon, fails on roughly two in every three attempts. The full methodology is available in the ClawBench paper on arXiv.


Technical Deep-Dive

What makes ClawBench methodologically valuable is its measurement depth. Each run captures five distinct layers of behavioural data: a full session replay of the browser interaction, screenshots at each decision point, complete HTTP traffic logs, the agent's internal reasoning traces, and a timestamped log of every browser action taken.

This multi-layer capture enables researchers to distinguish between different failure modes: was the agent confused by the page layout? Did it misinterpret the task description? Did it encounter an unexpected CAPTCHA or authentication step? Did it navigate correctly but fail on the final form submission? Each failure type points to a different area for improvement.

The 33.3% completion rate for Claude Sonnet 4.6 reflects the gap between LLM capability on text-based reasoning benchmarks — where frontier models now frequently score above human expert level — and the multi-step, visually-grounded, dynamically-rendered web environment that autonomous agents must navigate in practice. Handling cookie consent banners, dynamic JavaScript rendering, multi-step authentication flows, and session state management remains genuinely challenging for current agent systems.


The ASEAN Perspective

For ASEAN businesses evaluating AI automation for operational workflows, ClawBench data provides a useful calibration. The gap between marketing claims about AI agents and what current systems reliably accomplish on real websites is significant.

Tasks that involve navigating government portals — Singapore's CorpPass, Malaysia's MyGov, Indonesia's BPJS systems — are likely to score particularly low on benchmarks like ClawBench due to multi-factor authentication requirements, non-standard UI patterns, and dynamic form validation. Businesses considering deploying AI agents for government procurement filing, permit renewal, or regulatory submission should treat current agent capabilities as assistive rather than autonomous.

The benchmark does not suggest that autonomous web agents are not useful — it suggests that realistic expectations and human-in-the-loop design are essential for the next 12 to 24 months of deployment.


RECATOOLS Verdict

ClawBench is the kind of benchmark that the AI industry needed. The gap between what LLMs can do in controlled text environments and what they can accomplish on real production websites is substantial — and quantifying it precisely is the prerequisite to closing it.

The 33.3% score for the leading model should be read not as a failure but as a starting point. The same evaluation three years from now will likely tell a very different story. For now, it tells builders to design AI-assisted workflows with human review checkpoints at task completion rather than assuming autonomous end-to-end execution.


Frequently Asked Questions