The vision breakthrough that started it all
For years, AI could process text but couldn't see interfaces. Language models could summarize a document or answer a question, but they couldn't look at a login page and figure out where to type a password. That changed in early 2024 when models like GPT-4V and Claude 3 launched with native multimodal vision — the ability to interpret images alongside text.
This wasn't just an incremental improvement. It meant an AI could look at a screenshot of any web portal, understand the layout, identify buttons and form fields, and reason about what action to take next. The gap between 'reading about a screen' and 'navigating a screen' closed overnight.
From research demos to real products
In October 2024, Anthropic released Computer Use as a feature of Claude 3.5 Sonnet — the first major deployment of a model that could navigate computers the way a human would. Not through APIs or pre-coded integrations, but by looking at what's on screen and deciding what to do.
Three months later, in January 2025, OpenAI launched Operator, powered by their Computer-Using Agent (CUA) model. CUA combines GPT-4o's vision with reinforcement learning, operating through a loop of taking screenshots, reasoning through next steps, and performing actions like clicking, scrolling, and typing. It achieved 87% success on the WebVoyager benchmark and 38.1% on OSWorld for full computer use tasks.
By mid-2025, Google rolled out Project Mariner, powered by Gemini 2.5 Pro, running directly inside Chrome. All three major AI companies had shipped production browser agents within twelve months of each other.
How the benchmarks work — and what they actually measure
The AI research community measures computer-use capability through standardized benchmarks, and understanding them helps you evaluate what's real versus what's marketing.
WebVoyager tests end-to-end web tasks across 15 real websites — booking flights, searching for products, navigating government portals. It measures whether an agent can complete a multi-step task from start to finish on a live website. Current state of the art: Surfer-H at 92.2% success, with Browser Use at 89.1% and OpenAI's CUA at 87%.
OSWorld goes further — it tests full desktop computer use across Ubuntu, Windows, and macOS, including web apps, desktop apps, and file management. This is much harder. The best agents score around 38%, compared to humans at 72.4%. There's still a massive gap here.
ScreenSpot tests something more fundamental — can the model even find the right UI element? Given a screenshot and an instruction like 'click the submit button,' can it locate the correct pixel coordinates? Recent progress with multi-scale training on 4 million examples pushed grounding accuracy from ~5% to ~27% on ScreenSpot-Pro.
The takeaway: browser-based web tasks are approaching human performance. Full desktop computer use is still early. That's why browser-based automation is where this technology is production-ready today.
The open-source acceleration
What's remarkable about the last year is how fast open-source browser agents have caught up to — and in some cases surpassed — proprietary offerings. Browser Use, a fully open-source web agent, hit 89.1% on WebVoyager, outperforming OpenAI's Operator. Coasty-ai's open-computer-use project achieved 82% on OSWorld.
This matters because it means the underlying capability — AI that can navigate web portals reliably — is becoming commodity infrastructure. The differentiation shifts from 'can the model click the right button' to 'can you make this reliable, secure, and auditable enough for production use in regulated industries.' That's a very different problem, and it's where the engineering challenge lives today.
What this means for your automation
If your team spends time logging into portals, navigating forms, downloading reports, and entering data across browser-based systems — the AI to automate that work exists today, and it's improving at a pace that would have been hard to imagine two years ago.
The practical implication: automation is no longer limited to systems that have APIs. If a human can do it in a browser, computer-use AI can learn to do it too — and it can adapt when the portal changes, handle authentication, and log every step for compliance.
We're still early in the full-desktop story. But for browser-based work — which is where most portal automation, data entry, and back-office operations happen — the technology has crossed the threshold from research curiosity to production tool.