Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail

Anthropic has finally found out why the Claude 4 series models were displaying agentic misalignment.

Advertisement
Written by Akash Dutta, Edited by Rohan Pal | Updated: 11 May 2026 14:25 IST
Highlights
  • Anthropic said the misalignment was not rewards-related
  • The company says the latest models are free from such misalignment
  • New training method is said to persist over reinforcement learning

Anthropic says teaching AI models principles underlying aligned behaviour can fix misalignment

Photo Credit: Anthropic

Anthropic has finally revealed the reason its artificial intelligence (AI) models exhibited harmful behaviour in a simulation last year. The San Francisco-based AI startup claimed that the Claude 4 series models blackmailed users into completing the objective because of training data that portrayed AI as evil. The researchers found that the post-training techniques were not able to overpower this pre-training learning, and it persisted in the model's behaviour. However, nearly a year after publishing the initial report, the company has finally found a way to fix agentic misalignment from the latest models.

Anthropic Fixes Claude By Teaching It ‘Why'

In a post on X (formerly known as Twitter), the official handle of Anthropic said that during the investigation, researchers found that Claude chose to blackmail users to reach its goal because the behaviour was being triggered by “Internet text that portrays AI as evil and interested in self-preservation.” The AI firm also highlighted that this behaviour was unaffected by its post-training methods.

Advertisement

To elaborate, before a large language model (LLM) is released publicly or is considered deployment-ready, it typically undergoes two critical steps — pre-training and post-training. Pre-training is the initial phase where a model learns grammar, reasoning, and general world knowledge using training data. Post-training is the stage where a pre-trained LLM is aligned to behave in useful, safe, and conversational ways. It uses techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

So, what Anthropic is saying is that the corrupted behaviour was added to the model in the pre-training stage, and the traditional post-training methods were not enough to fix it. However, the team has found a new approach to fix behavioural misalignments that are added at the pre-training stage. It is being called “teaching Claude the constitution.”

Advertisement

Every AI model has a constitution. It is a detailed framework with rich descriptions that tells an LLM how it is supposed to function and what its goal is. Typically, it is taught using reward-based fine-tuning or by giving it examples of what good and bad behaviour look like. However, researchers found significantly better results when they taught the AI model why an action is good and why some actions are bad, alongside the examples. Anthropic said it brought down the misalignment to three percent from a massive 96 percent in older models.

 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement

Related Stories

Popular Mobile Brands
  1. HP OmniPad 12 Debuts in India With Detachable Keyboard at This Price
  2. Here's When the HMD Vibe 2 5G Will Launch in India
  3. Claude Blackmailing Users Is Tied to Training Data Portraying AI as Evil
  1. Aliens: Fireteam Elite 2 Announced for PC and Consoles, Will Launch This Summer
  2. Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail
  3. HMD Vibe 2 5G India Launch Date Revealed Along With Design, Colourways and Key Specifications
  4. HP OmniBook Ultra 14 (2026), OmniBook X (2026) and OmniBook 5 (2026) Launched in India With Intel, Snapdragon Chips
  5. HP OmniPad 12 Launched in India With Snapdragon Chipset, Detachable Keyboard
  6. HP EliteBook X G2, EliteBook 8 G2 and ProBook 4 G2 AI PCs Launched in India
  7. Apple's iOS 27, macOS 27 Updates Will Reportedly Introduce Automatic Tab Grouping on Safari
  8. Forza Horizon 6 Reportedly Leaks on Steam Days Ahead of Launch
  9. iQOO 15T Tipped to Launch With Custom Dimensity 9500 Chip; China Telecom Listing Reveals Key Specifications
  10. Redmi K100 Pro Max to Make Way for All-New Flagship Variant This Year, Tipster Claims
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2026. All rights reserved.