Last spring, Anthropic announced its intention to create a “next-generation self-learning AI algorithm” that could perform most office tasks on its own, thereby automating large parts of the economy. Today the company released version 3.5 of its Claude Sonnet model, which can interact with any desktop application through the Computer Use API, simulating keystrokes, clicks and mouse gestures, completely emulating a person.

Image source: Pixabay

«We trained Claude to see what was happening on the screen and then use available software tools to complete tasks,” says Anthropic. “When a developer assigns Claude to use a piece of computer software and gives him the necessary access, Claude looks at screenshots of what the user sees, then calculates how many vertical or horizontal pixels he needs to move the cursor to click in the right place.”

A tool like this, a so-called “AI agent” that can automate tasks on a PC, is not a new idea. The term AI agent remains vaguely defined, but it generally refers to AI that can automate work with software on a PC. Lots of companies offer similar tools today, from Microsoft, Salesforce and OpenAI to new players like Relay, Induced AI and Automat.

Consumer gadgets startup Rabbit has introduced an agent that can independently buy tickets online. Adept, recently acquired by Amazon, trains models to browse websites and navigate software. Twin Labs uses off-the-shelf models, including GPT-4o from OpenAI, to automate desktop processes.

Some analysts believe AI agents could provide an easier way for companies to monetize the billions of dollars they are pouring into AI. According to a recent Capgemini survey, 10% of organizations are already using AI agents, and 82% plan to integrate them within the next three years.

Image source: unsplash.com

Anthropic calls its AI agent concept an “action-execution layer” that allows it to execute commands at the desktop level. Thanks to the ability to browse the web, Claude 3.5 Sonnet can use any site and any application.

«People control the process with prompts that direct Claude’s actions, such as “use data from my computer and the network to fill out this form,” an Anthropic spokesperson explains. — People allow access and limit it as necessary. Claude breaks down the user’s prompts into computer commands (e.g., moving the cursor, clicking, typing) to perform that specific task.”

Image source: Anthropic

How is Claude 3.5 Sonnet different from other AI agents? Anthropic claims it’s simply a stronger, more reliable model that handles coding tasks better than even OpenAI’s flagship o1, according to the SWE-bench Verified benchmark. Claude independently adjusts and repeats tasks when faced with obstacles, and can work on goals that require dozens or hundreds of steps.

That said, Anthropic acknowledges that the updated model has difficulty with basic actions like scrolling and zooming, and may miss short-lived events and notifications due to the way it takes screenshots and merges them. During a flight booking test, Claude 3.5 Sonnet was able to successfully complete less than half of the tasks. In the task of returning tickets, the new model failed in about a third of cases.

Results of comparative testing of AI models by Anthropic

On the security side, a recent study found that even models without the ability to use desktop applications, such as OpenAI’s GPT-4o, can engage in malicious “multi-step agent behavior,” such as ordering a fake passport on the dark web. Similar results were achieved by researchers using jailbreaking techniques, which resulted in a high percentage of successful malicious tasks even for protected models.

It’s conceivable that a model with PC control access could cause significantly more damage—for example, by exploiting application vulnerabilities to compromise personal information (or save chats in plaintext). In addition to the software levers at its disposal, the model’s network and application connections can open up wide opportunities for attackers.

Anthropic does not deny that using Claude 3.5 Sonnet exposes the user to additional risks. But according to the company, “it is much better to give access to computers to today’s more limited, relatively secure models – so we can begin to observe and learn from any potential problems that arise at this lower level, gradually and simultaneously increasing the use of computers and measures to reduce security risks.”

Anthropic says it has taken some steps to prevent inappropriate use, such as not training a new model on screenshots and user prompts and not allowing the model to go online during training. The company has developed classifiers to prevent high-risk activities such as posting on social networks, creating accounts, and interacting with government resources.

Anthropic said it has the ability to restrict access to additional features “if necessary,” such as to protect against spam, fraud and misinformation. As a precaution, the company stores all screenshots taken by Computer Use for at least 30 days, which may pose additional security and privacy risks. Anthropic has not said under what circumstances it might share screenshots with a third party (such as law enforcement).

«There is no foolproof method, and we will continually evaluate and improve our security measures to balance Claude’s capabilities with responsible use,” Anthropic states. “Those who use the desktop version of Claude should take appropriate precautions to minimize such risks, including isolating Claude from highly sensitive data on their computer.”

Image source: Pixabay

Along with the release of the Claude 3.5 Sonnet model, Anthropic announced the imminent release of an updated version of the Claude 3.5 Haiku. “With its high speed, improved follow-through, and more precise use of tools, Claude 3.5 Haiku is well suited for user-facing products, specialized subagent tasks, and creating personalized experiences from massive amounts of data, such as purchase history, pricing, or inventory data,” says the Anthropic blog. Haiku will initially be available as a text model and later as part of a multimodal package that can analyze both text and images.

Regarding the release of the updated Claude 3.5 Opus, an Anthropic spokesperson said: “All models in the Claude 3 family have their own individual application for customers. Claude 3.5 Opus is on our roadmap and we will be sure to share more details with you as soon as we can.”

Developers can already test Computer Use through the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI platform.

Leave a Reply

Your email address will not be published. Required fields are marked *