Rysysth Technologies

Apple quietly dropped one of the most detailed critiques of AI “reasoning” models we’ve seen so far.

Their claim: current frontier LLMs aren’t really reasoning — they’re mimicking it.

The Illusion of Thinking: Where Reasoning Models Break

Apple’s team introduced a framework for evaluating reasoning in large models using controlled puzzle environments (e.g., Tower of Hanoi, Blocks World).

This setup allowed them to analyze not just answers — but how models "think" to get there.

Their findings were stark:

Accuracy collapsed completely as task complexity increased.
Thinking effort decreased when models faced harder problems — despite having enough computing budget.
Overthinking emerged — models explored wrong solutions even after finding the right one.
Even with the algorithms provided, models failed to follow logical steps reliably.

This wasn't just a few models either — Claude 3.7, DeepSeek-R1, and o3-mini, all exhibited similar limitations across tasks.

Rysysth Insights

This research forces a reframe: We don’t just need better benchmarks or more parameters — we need to understand what it actually means for an AI to reason.

Here’s what stood out to us:

Reasoning ≠ pattern recognition: Models can appear smart without being structurally sound.
Token budget isn’t enough: More thinking tokens don’t guarantee better reasoning — especially past a complexity threshold.
Three clear regimes emerged:

Low complexity: Non-thinking LLMs sometimes outperform LRMs.
Medium complexity: Reasoning models pull ahead — briefly.
High complexity: Everyone collapses.

"Overthinking" is real: Even when the solution is found early, models continue generating redundant reasoning.
Supplying the algorithm doesn’t help: Models still fail to execute correct steps consistently.

From Rysysth’s perspective, this isn’t a sign that AI is failing — it’s a sign we’re testing the right things. By moving beyond accuracy metrics and into trace-based reasoning analysis, we’re starting to understand not just what models do — but how they do it, and where they fail.

In a field obsessed with performance metrics, Apple’s paper is a reminder:

"Thinking" in AI may still be an illusion — at least for now.

Apple Study Reveals Reasoning Limits in Large Language Models