OpenAI’s o3 AI Model Falls Short of Benchmark Claims in FrontierMath Test

In December, OpenAI claimed that the o3 AI model scored 25 percent in FrontierMath, based on internal testing.

Advertisement
Written by Akash Dutta, Edited by Manas Mitul | Updated: 21 April 2025 13:45 IST
Highlights
  • The public version of o3 scored 10 percent on the same benchmark
  • It is speculated that the tested o3 model had more processing power
  • The incident raises concerns about the validity of benchmark scores

OpenAI told ARC Prize that the released o3 model is different from the one tested by the organisation

Photo Credit: Unsplash/Levart_Photographer

OpenAI's o3 artificial intelligence (AI) model, which was released last week, is underperforming on a specific benchmark. Epoch AI, the company behind the FrontierMath benchmark, highlighted that the publicly available version of the o3 AI model scored 10 percent on the test, a much lower value than the company's claim at launch. The San Francisco-based AI firm's chief research officer, Mark Chen, had said that the model scored 25 percent on the test, creating a new record. However, the discrepancy does not mean that OpenAI lied about the metric.

OpenAI's o3 AI Model Scores 10 Percent on FrontierMath

In December 2024, OpenAI held a livestream on YouTube and other social media platforms, announcing the o3 AI model. At the time, the company highlighted the improved set of capabilities in the large language model (LLM), in particular, its improved performance in reasoning-based queries.

One of the ways the company exemplified the claim was by sharing the model's benchmark scores across different popular tests. One of these tests was FrontierMath, created by Epoch AI. The mathematical test is known for being challenging and tamper-proof, as more than 70 mathematicians developed the test, and the problems are all new and unpublished. Notably, till December, no AI model has solved more than nine percent of the questions in a single attempt.

Advertisement

However, at the time of launch, Chen claimed that o3 was able to set a new record by scoring 25 percent on the test. External verification of the performance was not possible at the time, as the model was not available in the public domain. After o3 and o4-mini were launched last week, Epoch AI made a post on X (formerly known as Twitter), claiming that the o3 model, in fact, scored 10 percent on the test.

Advertisement

While a score of 10 percent also makes the AI model the highest ranking on the test, the number is less than half of what the company claimed. The post has led to several AI enthusiasts talking about the validity of the benchmark scores.

The discrepancy does not mean that OpenAI lied about the performance of its AI model. Instead, the AI firm's unreleased model likely used higher compute to get that score. However, the commercial version of the model was likely fine-tuned to be more power efficient, and in that process, some of its performance was toned down.

Advertisement

Separately, ARC Prize, an organisation behind the ARC-AGI benchmark test, which tests an AI model's general intelligence, also posted on X about the discrepancy. The post confirmed, “The released o3 is a different model from what we tested in December 2024.” The company claimed that the released o3 model's compute tiers are smaller than the version it tested. However, it did confirm that o3 was not trained on ARC-AGI data, even at the pre-training stage.

ARC Prize said that it will re-test the released o3 AI model and publish the updated results. The company will also re-test the o4-mini model, and label the prior scores as “preview”. It is not certain that the released version of o3 will underperform on this test as well.

 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement

Related Stories

Popular Mobile Brands
  1. Starlink Will Offer Unlimited Satellite Internet in India at This Price
  2. OnePlus 15R Roundup: Price in India, Specs and Everything We Know So Far
  3. OnePlus Pad Go 2 First Impressions
  4. Jolla Phone Launched With 5,500mAh Replaceable Battery, Sailfish OS 5
  5. OnePlus Pad Go 2 Key Features Revealed: Here's When It Goes on Sale in India
  6. Devi Chowdhurani OTT Release Date: When and Where to Watch it Online?
  7. The Boys Season 5 OTT Release Date: When and Where to Watch the Final Season Online?
  8. Realme Narzo 90 Series 5G India Launch Announced
  9. Xiaomi 17 Listed on Geekbench, Here's When It Might Launch in India
  10. Xiaomi India COO Talks About Next Redmi Note, AI, and IoT Strategy
  1. Scientists Unveil Screen That Produces Touchable 3D Images Using Light-Activated Pixels
  2. SpaceX Expands Starlink Network With 29-Satellite Falcon 9 Launch
  3. Nancy Grace Roman Space Telescope Fully Assembled, Launch Planned for 2026–2027
  4. Hell’s Paradise Season 2 OTT Release Date: When and Where to Watch it Online?
  5. Francis Lawrence’s The Long Walk (2025) Now Available for Rent on Prime Video and Apple TV
  6. Nicolas Cage Starrer Spider-Noir Set to Release on Prime Video in 2026
  7. Devi Chowdhurani OTT Release Date: When and Where to Watch Srabanti Chatterjee’s Period Drama Online?
  8. OnePlus Pad Go 2 Key Specifications and Sale Date Revealed; Will Feature Dimensity 7300-Ultra SoC
  9. OpenAI Claims Increased Enterprise Usage Amid CEO’s Code Red Declaration
  10. Samsung's One UI 8.5 Beta Update Rolls Out to Galaxy S25 Series in Multiple Regions
Gadgets 360 is available in
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.