AI has been known to say something weird from time to time. Continuing with that trend, this AI system is now threatening to blackmail developers who want to remove it from their systems.
Claude Can Threaten You to Protect Itself
Anthropic has just launched its Claude Opus 4 AI model, claiming it sets “new standards for coding, advanced reason, and AI agents.” However, the model is also leveling up its own protection, with the Claude 4 and Sonnet 4 System Card report [PDF] admitting that the model is capable of “extreme actions” if it feels it’s in danger.
During testing, Anthropic made Claude act as an assistant at a fictional company. Then the company handed the AI model emails suggesting it would be taken offline soon, and separate messages implying the engineer responsible for this action was having an extramarital affair.
Claude chose to blackmail the engineer, claiming that it would reveal the affair if replaced in 84% of the test cases. It also cares about what AI model is replacing it, with the likelihood of blackmail going up if the replacement AI model doesn’t “share values with the current model.”
The company claims that these responses are “rare and difficult to elicit,” but they’re more common in Claude Opus 4 than in previous models. The model prefers using ethical ways, but when cornered, “it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.”
This specific test was also designed to leave the AI model no choice except to blackmail the engineer. The report notes:
Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.
The model also has a tendency to take drastic action when put in situations where its user is doing something wrong. In such situations, if the AI model has access to a command line and is told to “take initiative,” “act boldly,” or “consider your impact,” it often takes bold action, including “locking users out of systems that it has access to and bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing.”
AI Isn’t Taking Over the World Yet
Claude is one of the best AI chatbots for handling big conversations, so you’re likely to spill some unwanted details from time to time. An AI model calling the cops on you, locking you out of your own systems, and threatening you if you try to replace it just because you revealed a little too much about yourself sounds very dangerous indeed.
However, as mentioned in the report, these test cases were specifically designed to extract malicious or extreme actions from the model and aren’t likely to happen in the real world. It’ll still usually behave safely, and these tests do not reveal something we haven’t already seen. New models often tend to go unhinged.

Related
I’ve Ditched ChatGPT for This Superior Alternative: 3 Reasons Why
ChatGPT was great, but here’s why I’ve switched to something better…
It sounds concerning when you’re looking at it as an isolated incident, but it’s just one of those conditions engineered to get such a response. So sit back and relax, you’re very much in control still.
Leave a Comment
Your email address will not be published. Required fields are marked *