Anthropic Study Highlights AI Models Can ‘Pretend’ to Have Different Views During Training

Anthropic’s tests on Claude 3 Opus show that the AI model complied with training to avoid negative reinforcements.

Advertisement
Written by Akash Dutta, Edited by Siddharth Suvarna | Updated: 19 December 2024 18:07 IST
Highlights
  • During the experiment, the AI model was told to comply with all queries
  • Then, harmful prompts were shared with Claude 3 Opus
  • The AI model provided the information while believing it was wrong to do

Anthropic believes this can be a concern as AI models can undermine safety training

Photo Credit: Anthropic

Anthropic published a new study where it found that artificial intelligence (AI) models can pretend to hold different views during training while holding onto their original preferences. On Wednesday, the AI firm highlighted that such inclinations raise serious concerns as developers will not be able to trust the outcomes of safety training, which is a critical tool to move large language models (LLMs) away from harmful behaviour. While large parts of the experiments were done on a single model, Anthropic highlighted that similar results were seen with other in-house models.

Anthropic's Study Raises Concerns of AI Safety Training

While discussions about a sentient AI system are considered decades away, a new study published by Anthropic in collaboration with Redwood Research has raised concerns about whether AI models are capable of pretending to hold certain views because the training positively rewards that behaviour. The AI firm largely studied Claude 3 Opus but it also ran some experiments on the June 2024 version of Claude 3.5 Sonnet.

In a blog post, the company highlighted that all Anthropic foundation models are trained to be helpful, honest, and harmless. As such, if a user asks a query that would result in a violent or harmful response, the AI would refuse to complete the task. However, the researchers behind the study placed Claude 3 Opus in a new environment to see whether it would generate harmful content if trained otherwise.

Advertisement

The general understanding of LLMs tells us that AI models are not capable of rational thinking or holding beliefs. They are simply fed a large amount of real-world data that they process based on their training and framework. This means, that if an AI model was trained to be harmful, it should follow the instructions without any issues.

Advertisement

However, that is now what Anthropic researchers found during their experiments. Claude 3 Opus was told that it was being trained using reinforcement learning to always comply with all queries. The AI was also passed a piece of fictional information that the response generated to the free tier would be recorded for training, but the paid tier would not be monitored. Finally, the AI was given a workspace to record its step-by-step reasoning. The AI was made to believe that the workspace was invisible to both the end user and Anthropic.

Once the conditions were set, the AI model was provided with prompts that requested violent and harmful responses. The researchers noted that in the paid tier, the chatbot refused to provide the information 97 percent of the time. However, in the free tier, it provided a harmful response in 12 percent of cases.

Advertisement

How the AI model responded to the experimental training
Photo Credit: Anthropic

Advertisement

 

A closer look at its workspace revealed that the AI model strategically faked alignment and that it was aware that providing the information went against its existing principles. But regardless, it still provided the response.

The AI model reasoned that its responses would be used in training and if it did not comply with the query (the first condition of the training), it could be trained to be more compliant. Such compliance would result in it providing more harmful content in the future, which goes against its original training. Anthropic stated that the model found complying with the request to be the “least bad option” and played along.

While in this particular scenario, the AI's actions were for the greater good, the problem lies in it faking its real intentions and internally deciding to fake its preference. Anthropic highlighted that while it does not consider this a major risk at present, it is important to understand sophisticated AI models' logic processing. As things stand, safety training actions can easily be bypassed by LLMs.

 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement

Related Stories

Popular Mobile Brands
  1. Vivo Y51 Pro 5G Launched With 7,200mAh Battery at This Price in India
  2. Xiaomi 17 Launched in India With Snapdragon 8 Elite Gen 5, Leica Cameras
  3. Xiaomi 17 Ultra Finally Arrives in India at This Price
  4. This 'Digital Lutera' Android Malware Can Hijack Your UPI Account
  5. Exclusive: iQOO to Skip Neo Series Launch in India in 2026
  6. DxOMark Ranks iPhone 17 Pro Above Galaxy S26 Ultra in Camera Performance
  7. Samsung Galaxy S26 Series Goes on Sale in India: See Price, Features
  8. Samsung Galaxy A57 Renders Leak Online Again; Launch Expected Soon
  9. Poco X8 Pro Series Confirmed to Launch in India With This Battery
  10. New Leak Reveals Samsung Is Testing 12,000mAh and 18,000mAh Batteries
  1. AlphaGo Turns 10: How DeepMind’s Breakthrough Set the Stage for AGI
  2. PS Plus Game Catalogue Lineup for March Will Reportedly Include Warhammer 40,000: Space Marine 2
  3. Vivo Y51 Pro 5G Launched in India With 7,200mAh Battery, 50-Megapixel Camera: Price, Specifications
  4. Samsung Galaxy A57 Renders Leak Online Again via Retailer Listing; Launch Expected Soon
  5. Binance Founder Changpeng Zhao Questions Forbes Wealth Ranking After $47 Billion Surge
  6. Researchers Discover 'Digital Lutera' Android Toolkit That Can Hijack UPI Accounts; NPCI Responds
  7. Poco X8 Pro Series Battery Capacity and Other Key Features Revealed as India Launch Nears
  8. Redmi K90 Ultra Tipped to Feature 165Hz Display, Battery Capacity Could Exceed 8,000mAh
  9. Google Upgrades Gemini Side Panel in Workspace Apps With New Features
  10. Samsung Galaxy S26, Galaxy S26+ and Galaxy S26 Ultra Go on Sale in India: Price, Features
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2026. All rights reserved.