Apple Claims AI Reasoning Models Suffer From ‘Accuracy Collapse’ When Solving Complex Problems

Apple researchers tested reasoning and non-reasoning models’ ability to solve mathematical puzzles.

Advertisement
Written by Akash Dutta, Edited by Siddharth Suvarna | Updated: 9 June 2025 12:04 IST
Highlights
  • The Tower of Hanoi was one of the puzzles solved by the models
  • Researchers gave the models three levels of complexity in tasks
  • Claude 3.7 Sonnet and DeepSeek V3/R1 was chosen for this experiment
Apple Claims AI Reasoning Models Suffer From ‘Accuracy Collapse’ When Solving Complex Problems

The research concluded that the reasoning-focused AI models lack logical accuracy

Photo Credit: Reuters

Apple published a research paper on Saturday, where researchers examine the strengths and weaknesses of recently released reasoning models. Also known as large reasoning models (LRMs), these are the models that “think” by utilising additional compute to solve complex problems. However, the paper found that even the most powerful models struggle with a complexity issue. Researchers said that when a problem is highly complex, the models experience a total collapse and give up on the problem instead of using more compute, which is something they're trained to do.

Apple says Reasoning Models Aren't Really Reasoning Beyond a Level

In a paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” published on Apple's website, the researchers claim both LRMs and large language models (LLMs) without thinking capability behave differently when faced with three regimes of complexity.

The paper has described three regimes of complexity which are low complexity tasks, medium complexity tasks, and high complexity tasks. To test how LLMs and LRMs function when dealing with a wide range of complexities, the researchers decided to use several puzzles that can have an increasing level of difficulty. One puzzle in particular was the Tower of Hanoi.

The Tower of Hanoi is a mathematical puzzle with three pegs and several disks. Disks are arranged in a decreasing order of size to create a pyramid-like shape. The objective of the puzzle is to shift the disks from the leftmost peg to the rightmost peg, while moving one disk at a time. There is a catch — at no time should a larger disk be placed on top of a smaller disk. It is not a very difficult puzzle, and it is often targeted at children between the ages of six and 15.

Advertisement

Mathematical puzzles solved by reasoning models
Photo Credit: Apple

Advertisement

 

Apple researchers chose two reasoning models and their non-reasoning counterparts for this experiment. The LLMs chosen were Claude 3.7 Sonnet and DeepSeek-V3, while the LRMs were Claude 3.7 Sonnet with Thinking and DeepSeek-R1. The thinking budget was maximised at 64,000 tokens each. The aim of the experiment was not just to check the final accuracy, but also the accuracy in logic in choosing the steps to solve the puzzle.

Advertisement

In the low complexity task, up to three disks were added, whereas for the medium complexity task, disk sizes were kept between four to 10. Finally, in the high complexity task, there were between 11-20 disks.

The researchers noted that both LLMs and LRMs displayed equal aptitude in solving the low complexity task. When the difficulty was increased, reasoning models were able to solve the puzzle more accurately, given the extra budget of compute. However, when the tasks reached the high complexity zone, it was found that both models showed a complete collapse of reasoning.

The same experiment was also said to be repeated with more models and more puzzles, such as Checkers Jumping, River Crossing, and Blocks World.

Apple's research paper highlights the concerns that several others in the artificial intelligence (AI) space have already expressed. While reasoning models can generalise within their distributed datasets, whenever any problem falls beyond them, the models struggle in “thinking,” and either try to take shortcuts in finding the solution, or completely give up and collapse.

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasising final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality,” the company said in a post.

 

For the latest tech news and reviews, follow Gadgets 360 on X, Facebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who'sThat360 on Instagram and YouTube.

Advertisement

Related Stories

Popular Mobile Brands
  1. OnePlus Nord 5, OnePlus Nord CE 5 Launched in India at These Prices
  2. Samsung Galaxy Buds 3 Pro's Amazon Prime Day 2025 Offer Revealed
  3. Oppo Reno 14 Gets a New Variant With a Colour Changing Rear Panel
  4. OnePlus Nord CE 5 Review
  5. AI+ Nova 5G, Pulse Phones India Launch Today: How to Watch Live Event
  6. OnePlus Nord 5 Review
  7. AI+ Pulse, AI+ Nova 5G With 50-Megapixel Rear Cameras Launched in India
  8. iQOO 13, iQOO Neo 10R and More Get Discounts During Prime Day 2025 Sale
  9. Apple Releases iOS 26 Beta 3 Update for iPhone With These New Features
  10. OnePlus Buds 4 With Up to 45 Hours of Total Battery Life Launched in India
  1. Vivo V60 Reportedly Listed on SIRIM and TUV Websites, Could Launch Soon
  2. Amazon Prime Day 2025 Sale: iQOO 13, iQOO Neo 10R, iQOO Z10x and More to Go on Sale at Discounted Prices
  3. Swiggy Instamart Teams Up With Jio for Instant Delivery of JioBharat V4 and JioPhone Prima 2
  4. Apple Maps in iOS 26 Beta Version Come With An Upgraded Search Feature: Report
  5. WhatsApp Rolls Out AI-Powered Chat Wallpaper Feature; Threaded Message Replies Spotted in Development
  6. Samsung Galaxy Watch 8 Series Could Launch With Gemini Voice Assistant
  7. Amazon Prime Day 2025 Sale: Samsung Galaxy Buds 3 Pro to Be Available at a Discounted Price
  8. Oppo Reno 14 Launched in New Finish With Temperature-Sensitive Colour Changing Rear Panel
  9. Microsoft Edge Can Now Load Websites Faster After Migration to WebUI 2.0, Says Company
  10. Samsung Galaxy S25 FE to Sport a 6.7-Inch Flexible OLED Display: Report
Gadgets 360 is available in
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.