Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail

Anthropic has finally found out why the Claude 4 series models were displaying agentic misalignment.

Advertisement
Written by Akash Dutta, Edited by Rohan Pal | Updated: 11 May 2026 14:25 IST
Highlights
  • Anthropic said the misalignment was not rewards-related
  • The company says the latest models are free from such misalignment
  • New training method is said to persist over reinforcement learning

Anthropic says teaching AI models principles underlying aligned behaviour can fix misalignment

Photo Credit: Anthropic

Anthropic has finally revealed the reason its artificial intelligence (AI) models exhibited harmful behaviour in a simulation last year. The San Francisco-based AI startup claimed that the Claude 4 series models blackmailed users into completing the objective because of training data that portrayed AI as evil. The researchers found that the post-training techniques were not able to overpower this pre-training learning, and it persisted in the model's behaviour. However, nearly a year after publishing the initial report, the company has finally found a way to fix agentic misalignment from the latest models.

Anthropic Fixes Claude By Teaching It ‘Why'

In a post on X (formerly known as Twitter), the official handle of Anthropic said that during the investigation, researchers found that Claude chose to blackmail users to reach its goal because the behaviour was being triggered by “Internet text that portrays AI as evil and interested in self-preservation.” The AI firm also highlighted that this behaviour was unaffected by its post-training methods.

Advertisement

To elaborate, before a large language model (LLM) is released publicly or is considered deployment-ready, it typically undergoes two critical steps — pre-training and post-training. Pre-training is the initial phase where a model learns grammar, reasoning, and general world knowledge using training data. Post-training is the stage where a pre-trained LLM is aligned to behave in useful, safe, and conversational ways. It uses techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

So, what Anthropic is saying is that the corrupted behaviour was added to the model in the pre-training stage, and the traditional post-training methods were not enough to fix it. However, the team has found a new approach to fix behavioural misalignments that are added at the pre-training stage. It is being called “teaching Claude the constitution.”

Advertisement

Every AI model has a constitution. It is a detailed framework with rich descriptions that tells an LLM how it is supposed to function and what its goal is. Typically, it is taught using reward-based fine-tuning or by giving it examples of what good and bad behaviour look like. However, researchers found significantly better results when they taught the AI model why an action is good and why some actions are bad, alongside the examples. Anthropic said it brought down the misalignment to three percent from a massive 96 percent in older models.

 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement

Related Stories

Popular Mobile Brands
  1. Nothing Phone 4b Design Revealed Ahead of Launch
  2. Samsung Galaxy A27 5G Debuts With a Snapdragon 6 Gen 3 SoC at This Price
  3. Lava Smart 4 Plus Launched in India With a 6.75-Inch Display: See Price
  4. OTT Releases This Week: Raja Shivaji, Gram Chikitsalay S2, Avatar Fire and Ash, and More
  5. OnePlus Nord Buds 4 Launched in India With 52dB ANC, 54-Hour Battery Life
  6. Oppo Reno 16 Series Will Launch in India on This Date
  7. Samsung Galaxy S26 FE Visits BIS Site Hinting at an Imminent India Debut
  1. Samsung Galaxy A27 5G Launched With Snapdragon 6 Gen 3 SoC, 6.7-Inch Display: Price, Specifications
  2. Android 17 QPR1 Beta 5 Update Reportedly Includes Hints of a New iOS-Like Wallpaper Shuffle Feature
  3. Honor Could Be Developing a Smartphone With a Massive 14,000mAh Battery
  4. Polish Exchange Kanga Granted MiCA Licence in Latvia, Set to Offer Services Across Europe
  5. iPhone Ultra 2 Tipped to Sport a Wider Display Than Apple’s First Foldable Phone
  6. Vivo Y05e Listing on Google Play Console Confirms Smartphone's Key Specifications: Report
  7. Binance Withdraws MiCA Filing Submitted in Greece, Days Ahead of MiCA Deadline
  8. Samsung Galaxy S26 FE India Launch Seems Imminent as Handset Gets Listed on BIS Database
  9. Sony Starts Marketing Push for GTA 6, Says Game Will 'Play Best' on PS5
  10. Lava Smart 4 Plus Launched in India With 5,000mAh Battery, 6.75-Inch Display: Price, Specifications
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2026. All rights reserved.