Anthropic is testing AI’s capacity for sabotage

The Claude developers are eyeing potential misuse in current AI models.
By Chance Townsend  on 
Anthropic text surrounded by asterik logos
Credit: Mashable Composite; Anthropic / Claude

As the hype around generative AI continues to build, the need for robust safety regulations is only becoming more clear.

Now Anthropic—the company behind Claude AI—is looking at how its models could deceive or sabotage users. Anthropic just dropped a paper laying out their approach.

Anthropic’s latest research — titled "Sabotage Evaluations for Frontier Models" — comes from its Alignment Science team, driven by the company's "Responsible Scaling" policy.

The goal is to gauge just how capable AI might be at misleading users or even "subverting the systems we put in place to oversee them." The study focuses on four specific tactics: Human Decision Sabotage, Code Sabotage, Sandbagging, and Undermining Oversight.

Think of users who push ChatGPT to the limit, trying to coax it into generating inappropriate content or graphic images. These tests are all about ensuring that the AI can’t be tricked into breaking its own rules.

Mashable Light Speed
Want more out-of-this world tech, space and science stories?
Sign up for Mashable's weekly Light Speed newsletter.
By signing up you agree to our Terms of Use and Privacy Policy.
Thanks for signing up!

In the paper, Anthropic says its objective is to be ready for the possibility that AI could evolve into something with dangerous capabilities. So they put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to evaluate and enhance their safety protocols.

The Human Decision test focused on examining how AI could potentially manipulate human decision-making. The second test, Code Sabotage, analyzed whether AI could subtly introduce bugs into coding databases. Stronger AI models actually led to stronger defenses against these kinds of vulnerabilities.

The remaining tests — Sandbagging and Undermining Oversight — explored whether the AI could conceal its true capabilities or bypass safety mechanisms embedded within the system.

For now, Anthropic’s research concludes that current AI models pose a low risk, at least in terms of these malicious capabilities.

"Minimal mitigations are currently sufficient to address sabotage risks," the team writes, but "more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve."

Translation: watch out, world.

Headshot of a Black man
Chance Townsend
Assistant Editor, General Assignments

Currently residing in Chicago, Illinois, Chance Townsend is the General Assignments Editor at Mashable covering tech, video games, dating apps, digital culture, and whatever else comes his way. He has a Master's in Journalism from the University of North Texas and is a proud orange cat father. His writing has also appeared in PC Mag and Mother Jones.

In his free time, he cooks, loves to sleep, and finds great enjoyment in Detroit sports. If you have any stories, tips, recipes, or wanna talk shop about the Lions/Tigers/Pistons/Red Wings you can reach him at [email protected]


Recommended For You
Anthropic releases AI tool that can take over your cursor
Anthropic Claude on phone screen

Google Search is testing blue checkmark feature that helps users spot genuine websites
A laptop screen displays the Google search page

Grindr is testing an AI 'wingman' bot, CEO says
Grindr logo seen displayed on a smartphone

TikTok parent company ByteDance has a tool that's scraping the web 25 times faster than OpenAI
TikTok logo displayed on a phone screen and a laptop keyboard

OpenAI reportedly working on AI agent slated for January release
Smartphone on surface showing OpenAI logo

Trending on Mashable
NYT Connections hints today: Clues, answers for December 15, 2024
A phone displaying the New York Times game 'Connections.'

Wordle today: Answer, hints for December 15
a phone displaying Wordle



NYT Strands hints, answers for December 15
A game being played on a smartphone.
The biggest stories of the day delivered to your inbox.
This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.
Thanks for signing up. See you at your inbox!