Anthropic has finally found out why the Claude 4 series models were displaying agentic misalignment.
Photo Credit: Anthropic
Anthropic says teaching AI models principles underlying aligned behaviour can fix misalignment
Anthropic has finally revealed the reason its artificial intelligence (AI) models exhibited harmful behaviour in a simulation last year. The San Francisco-based AI startup claimed that the Claude 4 series models blackmailed users into completing the objective because of training data that portrayed AI as evil. The researchers found that the post-training techniques were not able to overpower this pre-training learning, and it persisted in the model's behaviour. However, nearly a year after publishing the initial report, the company has finally found a way to fix agentic misalignment from the latest models.
In a post on X (formerly known as Twitter), the official handle of Anthropic said that during the investigation, researchers found that Claude chose to blackmail users to reach its goal because the behaviour was being triggered by “Internet text that portrays AI as evil and interested in self-preservation.” The AI firm also highlighted that this behaviour was unaffected by its post-training methods.
To elaborate, before a large language model (LLM) is released publicly or is considered deployment-ready, it typically undergoes two critical steps — pre-training and post-training. Pre-training is the initial phase where a model learns grammar, reasoning, and general world knowledge using training data. Post-training is the stage where a pre-trained LLM is aligned to behave in useful, safe, and conversational ways. It uses techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).
So, what Anthropic is saying is that the corrupted behaviour was added to the model in the pre-training stage, and the traditional post-training methods were not enough to fix it. However, the team has found a new approach to fix behavioural misalignments that are added at the pre-training stage. It is being called “teaching Claude the constitution.”
Every AI model has a constitution. It is a detailed framework with rich descriptions that tells an LLM how it is supposed to function and what its goal is. Typically, it is taught using reward-based fine-tuning or by giving it examples of what good and bad behaviour look like. However, researchers found significantly better results when they taught the AI model why an action is good and why some actions are bad, alongside the examples. Anthropic said it brought down the misalignment to three percent from a massive 96 percent in older models.
Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.
Aliens: Fireteam Elite 2 Announced for PC and Consoles, Will Launch This Summer