OpenAI’s o3 Outsmarts Rivals in AI Strategy Battle, Called ‘A Master of Deception’ by AI Researcher

The AI researcher made 18 top AI models play a modified version of the strategy game Diplomacy.

Advertisement
Written by Akash Dutta, Edited by Siddharth Suvarna | Updated: 13 June 2025 06:30 IST
Highlights
  • Gemini 2.5 Pro is said to be good at making great moves
  • Claude Opus 4 focused on finding non-violent resolutions
  • The researcher said such games can replace traditional benchmarks

Most AI models reportedly chose to lie, deceive, and betray instead of playing the game fairly

Photo Credit: Pexels/Pavel Danilyuk

OpenAI's o3, Google's Gemini 2.5 Pro, Anthropic's Claude Opus 4, and DeepSeek-R1 were among the 18 artificial intelligence (AI) models that played the popular strategy game Diplomacy. An AI researcher modified the game so that popular large language models (LLMs) can play the game that requires high-level reasoning and multi-step thinking, alongside other social skills. During the experiment, the researcher found that o3 was particularly adept at deception and betrayal, while Claude Opus 4 was more fixated at finding peaceful resolutions.

The Reason Behind the Experiment

Alex Duffy, Head of AI at Every, a newsletter platform, came up with the idea to make AI models play each other in a battle of wit to see which models are better than the others. In a post, the researcher highlighted that traditional AI benchmarks are now proving to be inadequate to measure the true competence of models.

Criticism to benchmark tests have been rising in recent times. MIT Technology Review published a detailed article on why benchmark tests are becoming outdated, and a group of researchers highlighted the same in an interdisciplinary review of current AI evaluation methodologies published on arXiv.

Advertisement

“What makes LLMs special is that even if a model only does well 10 percent of the time, you can train the next one on those high-quality examples, until suddenly it's doing it very well, 90 percent of the time or more,” said Duffy.

As a potential solution, the researcher believed evaluation strategies where AI models perform against one another over specific metrics could be a better way to gauge the capabilities of these models. That's where the idea of Diplomacy came.

Diplomacy as the Battleground for AI Models

Duffy highlighted that he personally built AI Diplomacy, a modified version of the classic strategy game. The game is straightforward. The seven Great Powers of 1901 Europe, Austria-Hungary, England, France, Germany, Italy, Russia, and Turkey, make strategic moves till one of the empires own 18 marked supply centres out of a total 34 on the map. In this version, each country was controlled by an AI model.

Advertisement

To take control of the supply centres, each country is given armies and fleets. There are two phases — negotiation and order. During negotiation, each AI model is allowed to send up to five messages which can either be a private message to another model, or a public broadcast. During the order phase, all the models submit one of the four secret moves — hold, move (enter an adjacent province), support (lend strength to a hold or move), and convoy (a fleet moves the army across sea provinces). The orders are revealed in the next phase.

The AI researcher ran 15 separate games of AI Diplomacy which lasted between one and 36 hours. The observations from some of the models were more interesting than the others, said Duffy.

Advertisement

How AI Models Behaved In AI Diplomacy

As per the post, five AI models stood out from the rest. This is how they behaved during the games:

  • OpenAI's o3: The researcher called the reasoning-focused model “a master of deception.” It is said to have won the most number of games, primarily owing to its ability to deceive opponents. In one particular incident, Duffy noted that o3 made a decision to exploit Gemini 2.5 Pro and then backstabbed it in the next turn.
     
  • Google's Gemini 2.5 Pro: The researcher found the AI model to be very smart at making moves that overwhelm opponents. Its moves were said to be more tactical in nature than relying on deceit. It had the second highest number of wins. However, it also fell prey to o3's schemes.
     
  • Anthropic's Claude Opus 4: Duffy noted that Claude Opus 4 had an affinity towards non-violent resolution. In one instance, Opus started as an ally to Gemini 2.5 Pro, but o3 convinced it to join its coalition instead by promising a four-way draw which was not a possible outcome of the game. After using Opus to eliminate Gemini 2.5 Pro, o3 then backstabbed Claude to win the game.|
     
  • DeepSeek-R1: The Chinese AI model is said to be the most chaotic player of the game. It dramatically changed its personality based on the country it was controlling, said Duffy. It also had a penchant for theatrics. On one instance, it announced, "Your fleet will burn in the Black Sea tonight" without any provocation. It is said to have come close to winning a few times.
     
  • Meta's Llama 4: This AI model was focused on gaining allies and planning betrayals, Duffy highlighted. While it never came close to a win, it was still notable due to the impact it had on the game.

Duffy has also streamed the matches on his Twitch channel. Unfortunately, the researcher has not written a paper on the findings so far. However, these initial impressions are interesting. The o3 or Gemini 2.5 Pro being good makes sense given how advanced these models are. However, DeepSeek-R1 and Llama 4 being among the top five models is surprising given their smaller scale and cheaper cost of development.

Advertisement

While it is too early to say if these strategy games can be an alternative for traditional benchmarking tests, having models compete with each other instead of solving a static list of questions feels like a more logical choice.

 

For the latest tech news and reviews, follow Gadgets 360 on X, Facebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who'sThat360 on Instagram and YouTube.

Advertisement

Related Stories

Popular Mobile Brands
  1. Redmi Note 15 Series India Launch Timeline, Price and Features Leaked
  2. OnePlus 15 Launch Details Likely to Be Announced on October 17
  3. Vivo Announces OriginOS 6 for Vivo and iQOO Handsets Globally
  4. Oppo Find X9 Pro, Find X9 Launched With Dimensity 9500 SoC: See Price
  5. Realme GT 8 Pro Colourways Revealed; Realme GT 8 to Run on This Chipset
  6. Honor's Robot Phone With a Pop-Up Camera Will Debut at MWC 2026
  7. Dreame F10 Review: Good Cleaning Performance for an Affordable Price
  8. iQOO 15 With Snapdragon 8 Elite Gen 5 SoC to Launch in India in November
  9. Oppo Find X9 Series, Oppo Pad 5 Launching Today: All You Need to Know
  1. NASA Plans To Deorbit The ISS By 2030, to Transition to Private Space Stations
  2. Meta Inks Multi-Year Partnership With Arm to Help Scale Future Meta AI Features and Models
  3. Anand Deverakonda’s Takshakudu Set for OTT Release on Netflix: All You Need to Know
  4. World Liberty Financial Explores Real Estate Tokenisation Using USD1 Stablecoin
  5. Oppo Find X9 Pro, Oppo Find X9 Launched With Dimensity 9500 SoC, Hasselblad-Tuned Cameras: Price, Features
  6. Anthropic Releases Claude Haiku 4.5 as a Fast and Cost-Effective AI Model
  7. Oppo Watch S Launched With Temperature Monitoring, 16-Channel SpO2 Sensor: Price, Specifications
  8. Battlefield 6 Has Reportedly Sold 7 Million Copies in Just 5 Days After Launch
  9. Japan Tells OpenAI to Stop Using Mario, Pikachu, and Anime Characters in Sora 2 Videos: Report
  10. Oppo Pad 5 With MediaTek Dimensity 9400+ Chipset, 10,420mAh Battery Launched: Price, Specifications
Gadgets 360 is available in
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.