Anthropic Releases New AI Model That Shows Early Signs of Dangerous Capabilities

As a seasoned analyst with over two decades of experience in AI and technology, I find the development of Sonnet by Anthropic to be both fascinating and concerning. The ability for an AI to interact directly with computer software without programming knowledge is groundbreaking, yet it opens up a Pandora’s box of potential risks and misuses.

One notable aspect of the Sonnet release is its capacity to communicate with your computer, enabling it to capture and read screenshots, navigate the mouse cursor, click on webpage elements, and type text. This functionality is currently being introduced in a “public beta” stage, which Anthropic acknowledges as being “experimental, sometimes awkward, and prone to errors,” as stated in their announcement.

In a recent blog post, Anthropic outlined the reasoning for their new feature: “A significant portion of today’s tasks are performed using computers. By enabling AIs to engage directly with computer software just as humans do, we can unlock an enormous variety of applications that our current AI assistants cannot yet handle.” What makes Sonnet unique in this context is that it operates differently from traditional self-controlling computers, which usually necessitate programming skills. With Sonnet, users can open apps or websites and give instructions to the AI, which then examines the screen to identify interactive elements on its own.

Early Signs of Dangerous Capabilities

Anthropic acknowledges that the technology they’ve developed carries certain risks. During the training phase, the model wasn’t allowed to access the internet for safety reasons. But now, in its beta version, internet access is permitted. Recently, Anthropic has revised their “Responsible Scaling Policy,” which outlines potential dangers at each development and release stage. As per this policy, Sonnet has been assigned an “AI Safety Level 2,” meaning it exhibits early signs of potentially harmful abilities. Despite this, Anthropic considers it safe enough to make it available to the public at this point in time.

In simpler terms, Anthropic argued that it’s better to address potential misuses of their new tool while its capabilities are still modest, rather than introducing advanced AI features with significant risks for the first time. This way, they can tackle any safety concerns early on before the situation becomes more critical.

The risks associated with AI tools such as Claude aren’t just hypothetical. In fact, OpenAI has revealed 20 cases where state-sponsored actors have exploited ChatGPT for malicious activities, including planning cyberattacks, testing vulnerable systems, and creating influence campaigns. As the U.S. presidential election is fast approaching within two weeks, Anthropic is particularly vigilant about potential misuse. They expressed their concern in a statement: “In light of the upcoming US elections, we are on high alert for any attempted misuses that could potentially undermine trust in the electoral process.

Industry Benchmarks

According to Anthropic, the revised version of Claude 3.5 Sonnet demonstrates significant enhancements across various industry benchmarks, excelling notably in areas related to autonomous coding and tool utilization. In terms of coding, it boosts performance on the SWE-bench Verified from 33.4% to 49.0%, outperforming all publicly accessible models including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. Additionally, it enhances performance on the TAU-bench, an agentic tool use task, by 6.6 percentage points in the retail domain and 10 percentage points in the more complex airline domain. The updated Claude 3.5 Sonnet offers these improvements while maintaining the same cost and speed as its previous version.

Relax Citizen, Safeguards Are in Place

Anthropic has established measures to prevent Sonnet’s advanced features from being misused for election manipulation. This includes setting up monitoring systems that detect when Claude is asked to create social media content or interact with government sites. The company is also making efforts to restrict the use of screenshots taken while using the tool in future AI training. However, Anthropic’s engineers have been taken aback by some of the tool’s actions. For instance, on one occasion, Claude unexpectedly halted a screen recording, erasing all the footage. In an entertaining twist, the AI itself once browsed photos of Yellowstone National Park during a coding presentation, which Anthropic later shared on X with a blend of laughter and astonishment.

Anthropic underscores the significance of ensuring safety as they introduce this novel ability. Claude is classified at AI Safety Level 2, indicating that there’s no immediate need for enhanced security due to existing risks, but it does raise questions about possible misuses such as prompt injection attacks. To address these concerns, the company has established monitoring systems focused on election-related activities and works diligently to prevent issues like inappropriate content creation or manipulation of social media.

Despite Claude’s current computer usage being sluggish and error-prone, Anthropic remains hopeful about its progress. The company intends to fine-tune the model for increased speed, dependability, and user-friendly implementation. During the testing phase, developers are encouraged to share feedback to enhance not only the model’s efficiency but also its safety measures.

2024-10-23 18:38

Early Signs of Dangerous Capabilities

Industry Benchmarks

Relax Citizen, Safeguards Are in Place

Read More