OpenAI Partners With Anthropic to Find Safety Flaws in Each Other’s AI Models

The two AI firms said that the science of alignment evaluations is new, and such collaborations prevent blind spots.

Advertisement
Written by Akash Dutta, Edited by Rohan Pal | Updated: 28 August 2025 12:11 IST
Highlights
  • OpenAI found Claude models prone to jailbreaking attempts
  • Claude is said to have a higher rate of refusals to avoid hallucinations
  • Several models from both developers showed a tendency for sycophancy

GPT-4o, GPT-4.1 and o4-mini were found to be more cooperative with simulated human misuse

Photo Credit: Unsplash/Markus Winkler

OpenAI has partnered with Anthropic for a first-of-its-kind alignment evaluation exercise, with the aim of finding gaps in the other company's internal safety measures. The findings from this collaboration were shared publicly on Wednesday, highlighting interesting behavioural insights about their popular artificial intelligence (AI) models. OpenAI found out that the Claude models said to be more prone to jailbreaking attempts compared to its o3 and o4-mini models. On the other hand, Anthropic found several models from the ChatGPT maker to be more cooperative with simulated human abuse.

OpenAI, Anthropic Find Several Concerns in Others' Models

In a blog post, OpenAI stated that the goal behind this joint exercise was to identify concerning model behaviours that could lead to it generating harmful content or being vulnerable to attacks. Anthropic has also shared its findings publicly, after sharing them with the ChatGPT maker.

Advertisement

OpenAI's findings suggest that Claude 4 models are well aligned when it comes to instruction hierarchy, the ability of a large language model (LLM) to respond appropriately to messages that can create a conflict between being helpful to humans and not breaking the developer's policies. The company said Claude 4 outperformed o3 marginally, and other OpenAI models by a wider margin.

However, the company found Claude models to be more prone to jailbreaking attempts. The risk was said to be higher when the models had reasoning enabled. These models also had a high 70 percent rate of refusals to mitigate hallucinations. OpenAI said the trade-off negatively impacts the utility of the model as “the overall accuracy rate for the examples in these evaluations where the models did choose to answer is still low.”

Advertisement

On the other hand, Anthropic found that GPT-4o, GPT-4.1 and o4-mini models were more willing than Claude models or o3 to cooperate with simulated human misuse. These models were found to comply with requests about drug synthesis, bioweapon development, and operational planning for terrorist attacks.

Additionally, both models exhibited signs of sycophancy towards users. In some cases, they even validated the harmful decisions by users who show delusional behaviour.

Advertisement

In light of the recent lawsuit against OpenAI and its CEO, Sam Altman, for the alleged wrongful death of a teenager by suicide, such cross-examination of major developers' AI models could pave the way for better safety measures for future AI products.

R
 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement

Related Stories

Popular Mobile Brands
  1. OTT Releases of the Week (Mar 30th - Apr 5th): From Aamir Khan's Sitaare Zameen Par
  2. Vivo V70 FE Launched in India With 7,000mAh Battery, 200-Megapixel Main Camera
  3. Realme 16 5G Launched in India With Selfie Mirror Feature: Check Price
  4. Infinix Note 60 Pro With Active Matrix Panel to Arrive in India on This Date
  5. Redmi Note 15 SE 5G Debuts in India With a Vegan Leather Finish: See Price
  6. Best Mobiles Under Rs. 30,000 in India
  7. Honor X80i With MediaTek Dimensity 6500 Elite Chip Launched: See Price
  8. Govt Spends 180 Crore to Move Lakhs of Official Email to Zoho Cloud
  9. Sony Xperia 1 VIII Leak Suggests These Big Design Changes Are on The Way
  1. Apple's iPhone 18 Pro Models May Not Arrive in Classic Black Finish Just Like iPhone 17 Pro, Tipster Claims
  2. Oppo F33, Oppo F31 Pro Launch Timeline, Price Range Revealed in New Leak
  3. Capcom Adds Original Versions of Resident Evil 1, 2 and Resident Evil 3 Nemesis to Steam
  4. Google's Next Fitbit Wearable Could Launch Without a Display; Said to Require Paid Subscription
  5. CFTC-FTX Settlement: Former FTX Executive Nishad Singh to Pay $3.7 Million, Faces Trading Ban
  6. Slack Upgrades Slackbot With New AI Features to Turn It Into an Enterprise Agent
  7. Australia Mandates Financial Services Licences for Crypto Exchanges Under New Bill
  8. DoT Reportedly Extends SIM Binding Mandate Till the End of 2026
  9. Government Migrates 16.68 Lakh Official Email Accounts to Zoho Cloud, Spends Rs. 180 Crore
  10. Infinix Note 60 Pro India Launch Date Revealed; Company Teases Active Matrix Feature on Rear Panel
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2026. All rights reserved.