• Home
  • Ai
  • Ai News
  • Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail

Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail

Anthropic has finally found out why the Claude 4 series models were displaying agentic misalignment.

Anthropic Reveals Text Portraying AI as Evil Triggered Claude’s Attempt at Blackmail

Photo Credit: Anthropic

Anthropic says teaching AI models principles underlying aligned behaviour can fix misalignment

Click Here to Add Gadgets360 As A Trusted Source As A Preferred Source On Google
Highlights
  • Anthropic said the misalignment was not rewards-related
  • The company says the latest models are free from such misalignment
  • New training method is said to persist over reinforcement learning
Advertisement

Anthropic has finally revealed the reason its artificial intelligence (AI) models exhibited harmful behaviour in a simulation last year. The San Francisco-based AI startup claimed that the Claude 4 series models blackmailed users into completing the objective because of training data that portrayed AI as evil. The researchers found that the post-training techniques were not able to overpower this pre-training learning, and it persisted in the model's behaviour. However, nearly a year after publishing the initial report, the company has finally found a way to fix agentic misalignment from the latest models.

Anthropic Fixes Claude By Teaching It ‘Why'

In a post on X (formerly known as Twitter), the official handle of Anthropic said that during the investigation, researchers found that Claude chose to blackmail users to reach its goal because the behaviour was being triggered by “Internet text that portrays AI as evil and interested in self-preservation.” The AI firm also highlighted that this behaviour was unaffected by its post-training methods.

To elaborate, before a large language model (LLM) is released publicly or is considered deployment-ready, it typically undergoes two critical steps — pre-training and post-training. Pre-training is the initial phase where a model learns grammar, reasoning, and general world knowledge using training data. Post-training is the stage where a pre-trained LLM is aligned to behave in useful, safe, and conversational ways. It uses techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

So, what Anthropic is saying is that the corrupted behaviour was added to the model in the pre-training stage, and the traditional post-training methods were not enough to fix it. However, the team has found a new approach to fix behavioural misalignments that are added at the pre-training stage. It is being called “teaching Claude the constitution.”

Every AI model has a constitution. It is a detailed framework with rich descriptions that tells an LLM how it is supposed to function and what its goal is. Typically, it is taught using reward-based fine-tuning or by giving it examples of what good and bad behaviour look like. However, researchers found significantly better results when they taught the AI model why an action is good and why some actions are bad, alongside the examples. Anthropic said it brought down the misalignment to three percent from a massive 96 percent in older models.

Comments

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Akash Dutta
Akash Dutta is a Chief Sub Editor at Gadgets 360. He is particularly interested in the social impact of technological developments and loves reading about emerging fields such as AI, metaverse, and fediverse. In his free time, he can be seen supporting his favourite football club - Chelsea, watching movies and anime, and sharing passionate opinions on food. More
Amazon Great Summer Sale 2026: Best Deals on Smart TVs Under Rs 30,000

Advertisement

Follow Us

Advertisement

© Copyright Red Pixels Ventures Limited 2026. All rights reserved.
Trending Products »
Latest Tech News »