Apple Study Reveals Reasoning Limits in Large Language Models

Apple quietly dropped one of the most detailed critiques of AI “reasoning” models we’ve seen so far.
Their claim: current frontier LLMs aren’t really reasoning — they’re mimicking it.
The Illusion of Thinking: Where Reasoning Models Break
Apple’s team introduced a framework for evaluating reasoning in large models using controlled puzzle environments (e.g., Tower of Hanoi, Blocks World).
This setup allowed them to analyze not just answers — but how models "think" to get there.
Their findings were stark:
- Accuracy collapsed completely as task complexity increased.
- Thinking effort decreased when models faced harder problems — despite having enough computing budget.
- Overthinking emerged — models explored wrong solutions even after finding the right one.
- Even with the algorithms provided, models failed to follow logical steps reliably.
This wasn't just a few models either — Claude 3.7, DeepSeek-R1, and o3-mini, all exhibited similar limitations across tasks.
Rysysth Insights
This research forces a reframe: We don’t just need better benchmarks or more parameters — we need to understand what it actually means for an AI to reason.
Here’s what stood out to us:
- Reasoning ≠ pattern recognition: Models can appear smart without being structurally sound.
- Token budget isn’t enough: More thinking tokens don’t guarantee better reasoning — especially past a complexity threshold.
- Three clear regimes emerged:
- Low complexity: Non-thinking LLMs sometimes outperform LRMs.
- Medium complexity: Reasoning models pull ahead — briefly.
- High complexity: Everyone collapses.
- "Overthinking" is real: Even when the solution is found early, models continue generating redundant reasoning.
- Supplying the algorithm doesn’t help: Models still fail to execute correct steps consistently.
From Rysysth’s perspective, this isn’t a sign that AI is failing — it’s a sign we’re testing the right things. By moving beyond accuracy metrics and into trace-based reasoning analysis, we’re starting to understand not just what models do — but how they do it, and where they fail.
In a field obsessed with performance metrics, Apple’s paper is a reminder:
"Thinking" in AI may still be an illusion — at least for now.
"This isn’t a sign that AI is failing — it’s a sign we’re testing the right things. By moving beyond accuracy metrics and into trace-based reasoning analysis, we’re starting to understand not just what models do — but how they do it, and where they fail."