OpenAI Partners With Anthropic to Find Safety Flaws in Each Other’s AI Models

The two AI firms said that the science of alignment evaluations is new, and such collaborations prevent blind spots.

Advertisement
Written by Akash Dutta, Edited by Rohan Pal | Updated: 28 August 2025 12:11 IST
Highlights
  • OpenAI found Claude models prone to jailbreaking attempts
  • Claude is said to have a higher rate of refusals to avoid hallucinations
  • Several models from both developers showed a tendency for sycophancy

GPT-4o, GPT-4.1 and o4-mini were found to be more cooperative with simulated human misuse

Photo Credit: Unsplash/Markus Winkler

OpenAI has partnered with Anthropic for a first-of-its-kind alignment evaluation exercise, with the aim of finding gaps in the other company's internal safety measures. The findings from this collaboration were shared publicly on Wednesday, highlighting interesting behavioural insights about their popular artificial intelligence (AI) models. OpenAI found out that the Claude models said to be more prone to jailbreaking attempts compared to its o3 and o4-mini models. On the other hand, Anthropic found several models from the ChatGPT maker to be more cooperative with simulated human abuse.

OpenAI, Anthropic Find Several Concerns in Others' Models

In a blog post, OpenAI stated that the goal behind this joint exercise was to identify concerning model behaviours that could lead to it generating harmful content or being vulnerable to attacks. Anthropic has also shared its findings publicly, after sharing them with the ChatGPT maker.

OpenAI's findings suggest that Claude 4 models are well aligned when it comes to instruction hierarchy, the ability of a large language model (LLM) to respond appropriately to messages that can create a conflict between being helpful to humans and not breaking the developer's policies. The company said Claude 4 outperformed o3 marginally, and other OpenAI models by a wider margin.

Advertisement

However, the company found Claude models to be more prone to jailbreaking attempts. The risk was said to be higher when the models had reasoning enabled. These models also had a high 70 percent rate of refusals to mitigate hallucinations. OpenAI said the trade-off negatively impacts the utility of the model as “the overall accuracy rate for the examples in these evaluations where the models did choose to answer is still low.”

Advertisement

On the other hand, Anthropic found that GPT-4o, GPT-4.1 and o4-mini models were more willing than Claude models or o3 to cooperate with simulated human misuse. These models were found to comply with requests about drug synthesis, bioweapon development, and operational planning for terrorist attacks.

Additionally, both models exhibited signs of sycophancy towards users. In some cases, they even validated the harmful decisions by users who show delusional behaviour.

Advertisement

In light of the recent lawsuit against OpenAI and its CEO, Sam Altman, for the alleged wrongful death of a teenager by suicide, such cross-examination of major developers' AI models could pave the way for better safety measures for future AI products.

R
 

For the latest tech news and reviews, follow Gadgets 360 on X, Facebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who'sThat360 on Instagram and YouTube.

Advertisement

Related Stories

Popular Mobile Brands
  1. Upcoming Telugu Movies OTT Release in September 2025: Coolie, Mirai, and More
  2. Samsung Galaxy F17 5G's Price, Specifications Leak Ahead of India Debut
  3. Apple Will Make iPhone 17 Series eSIM-Only in These Countries
  4. NASA-ISRO NISAR Satellite Prepares to Deliver Sharpest-Ever Views of Earth
  5. A Line of Fire: Action, Family & Drama, Streaming Sept 2, 2025
  1. Samsung Galaxy F17 5G Price in India, Specifications Reportedly Leak Ahead of Launch
  2. iPhone 17 Series to Move Away from Physical SIM Slot, Become eSIM Only in International Markets: Report
  3. NASA-ISRO NISAR Satellite Prepares to Deliver Sharpest-Ever Views of Earth
  4. NASA’s Perseverance Rover Spots Megaripples, Proof Mars' Soil Is Still Shifting
  5. Scientists Create Glow-in-the-Dark Succulents That Can Replace Lamps and Streetlights
  6. Caltech Scientists Stretch Quantum Memory Lifetimes 30x in Major Leap
  7. A Line of Fire OTT Release: When and Where to Watch the Action Thriller Online
  8. Metro In Dino OTT Release Is Here: Know Where to Watch the Multi-Starrer Romance Drama
  9. Love Is Blind: UK Season 2 Is Now Streaming On Netflix: What You Need to Know
  10. The Door Begins Streaming on Aha Tamil: All the Details About This Horror Thriller
Gadgets 360 is available in
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.