Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail

Anthropic has finally found out why the Claude 4 series models were displaying agentic misalignment.

Written by Akash Dutta, Edited by Rohan Pal | Updated: 11 May 2026 14:25 IST

Highlights

Anthropic said the misalignment was not rewards-related
The company says the latest models are free from such misalignment
New training method is said to persist over reinforcement learning

Anthropic says teaching AI models principles underlying aligned behaviour can fix misalignment

Photo Credit: Anthropic

Anthropic has finally revealed the reason its artificial intelligence (AI) models exhibited harmful behaviour in a simulation last year. The San Francisco-based AI startup claimed that the Claude 4 series models blackmailed users into completing the objective because of training data that portrayed AI as evil. The researchers found that the post-training techniques were not able to overpower this pre-training learning, and it persisted in the model's behaviour. However, nearly a year after publishing the initial report, the company has finally found a way to fix agentic misalignment from the latest models.

Anthropic Fixes Claude By Teaching It ‘Why'

In a post on X (formerly known as Twitter), the official handle of Anthropic said that during the investigation, researchers found that Claude chose to blackmail users to reach its goal because the behaviour was being triggered by “Internet text that portrays AI as evil and interested in self-preservation.” The AI firm also highlighted that this behaviour was unaffected by its post-training methods.

Anthropic Discussion

Explore More...

To elaborate, before a large language model (LLM) is released publicly or is considered deployment-ready, it typically undergoes two critical steps — pre-training and post-training. Pre-training is the initial phase where a model learns grammar, reasoning, and general world knowledge using training data. Post-training is the stage where a pre-trained LLM is aligned to behave in useful, safe, and conversational ways. It uses techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

Anthropic Releases Claude Add-Ins for Microsoft Excel, PowerPoint and Word

So, what Anthropic is saying is that the corrupted behaviour was added to the model in the pre-training stage, and the traditional post-training methods were not enough to fix it. However, the team has found a new approach to fix behavioural misalignments that are added at the pre-training stage. It is being called “teaching Claude the constitution.”

Every AI model has a constitution. It is a detailed framework with rich descriptions that tells an LLM how it is supposed to function and what its goal is. Typically, it is taught using reward-based fine-tuning or by giving it examples of what good and bad behaviour look like. However, researchers found significantly better results when they taught the AI model why an action is good and why some actions are bad, alongside the examples. Anthropic said it brought down the misalignment to three percent from a massive 96 percent in older models.

Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail

Anthropic Fixes Claude By Teaching It ‘Why'

Related Stories