A Deep Dive into the Strengths and Limits of Large Language Models (LLMs)

Large language models (LLMs) have come to dominate natural‑language AI, and a new generation—Large Reasoning Models (LRMs)—now claims to “think” via extended chain‑of‑thought (CoT) outputs. But is this genuine reasoning or merely a high‑tech parlor trick? In their paper “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, Shojaee et al. (Apple) tackle this question head‑on. Here’s a detailed unpacking of their approach, results, and the implications for AI’s reasoning frontier.

Paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Why We Need Better Reasoning Evaluations

The Rise of Chain‑of‑Thought

Chain‑of‑Thought prompting (CoT) nudges LLMs to produce intermediate reasoning steps, boosting performance on math and logic tasks.
New RL‑tuned “thinking” models (e.g., DeepSeek‑R1, Claude 3.7 Thinking, Gemini Flash Thinking) explicitly optimize for richer CoTs.

Shortcomings of Standard Benchmarks

Data Contamination: Models often train on the very math problems used to evaluate them (e.g., MATH500), making “success” ambiguous.
Final‑Answer Focus: Benchmarks score only the correctness of the answer, ignoring whether the reasoning path is valid or merely memorized.

Insight: Without understanding how a model arrives at an answer, we can’t tell if it’s truly reasoning or just pattern‑matching.

ALSO READ:

A Controlled Puzzle Testbed

To probe real reasoning, the authors turn to four deterministic puzzles where every step can be rigorously checked:

Puzzle	Core Challenge	Complexity Control
Tower of Hanoi	Recursive disk moves	Number of disks (n)
Checker Jumping	One‑dimensional color swap	Number of checkers (2n+1)
River Crossing	Constraint satisfaction (agents)	Number of pairs (n)
Blocks World	Stack reconfiguration	Number of blocks (n)

Each puzzle comes with a simulator that:

Validates every intermediate move (e.g., no larger disk on a smaller peg),
Tracks full state transitions,
Detects any violation instantly, so we know exactly where and why a model fails.

Experimental Setup

Model Pairs

Thinking vs. Non‑Thinking
- Claude 3.7 Sonnet (with & without CoT)
- DeepSeek‑R1 (RL‑tuned) vs. DeepSeek‑V3 (base)
Token Budget Matched
Both variants get the same inference‑time token allowance (up to 64k), removing compute as a confound.

Comparative analysis of thinking versus non-thinking models across math benchmarks
reveals inconsistent performance patterns.

Complexity Regimes

Low Complexity: Minimal puzzle size (e.g., 3 disks in Hanoi)
Medium Complexity: Challenging but solvable (e.g., 7 disks)
High Complexity: Beyond human ease, where solutions require hundreds of steps

For each setting, models attempt 25 samples; success is measured by exactly reaching the goal state.

Three Performance Regimes

Regime I: Low Complexity

Surprising Winner: Non‑thinking LLMs often outperform their CoT‑enabled peers.
- Fewer tokens used.
- Higher accuracy on trivial instances.
Why? “Overhead” of producing long thoughts can introduce noise when simple heuristics suffice.

Regime II: Medium Complexity

Thinking Models Pull Ahead
- Their extended CoTs help explore deeper solution paths.
- Accuracy gap widens in favor of LRMs.

Regime III: High Complexity

Universal Collapse
- Both thinking and non‑thinking models collapse to near‑zero accuracy.
- Even with plenty of tokens left, LRMs stop expanding their thought traces—a scaling limit.

The Counterintuitive Scaling Limit

Token Usage Peaks, Then Falls

As puzzle difficulty ramps up, LRMs initially consume more reasoning tokens—until a critical threshold. Beyond that:

They reduce their thought length, despite having unused budget.
Reflects an internal “give‑up” heuristic rather than graceful scaling.

Implication

Models aren’t genuinely scaling their problem‑solving effort; they’re following learned heuristics that break down under deep compositional depth.

Peeking Inside the “Thoughts”

Using the puzzle simulators, the authors extract every candidate solution step embedded in the CoTs.

Overthinking in Easy Cases

The earliest correct move appears within the first 10% of tokens.
Yet the model continues exploring bad paths, cluttering its trace with errors.

Delayed Corrections in Medium Cases

Correct moves migrate toward the latter half of the CoT.
Indicates useful self‑reflection, but at high compute cost.

Total Failure in Hard Cases

No correct moves at any point.
Suggests models never discover valid sub‑solutions when compositional complexity soars.

Algorithm Execution: No Free Pass

Even when the exact algorithm for the Tower of Hanoi is provided in the prompt:

Models still fail at roughly the same complexity threshold.
They struggle not just with devising but even with executing a known procedure.

Takeaway: LRMs lack robust symbolic manipulation—you can’t simply feed them pseudo‑code and expect flawless execution.

Surprising Puzzle‑Specific Behaviors

Tower of Hanoi: Models can string together 100+ correct moves for n=10 before a single error.
River Crossing: Fail as early as move 4 for just 3 actor‑agent pairs.

Hypothesis: Web‑trained LMs have seen many Hanoi solutions online, but far fewer River Crossing examples—revealing a heavy reliance on memorization.

Broader Implications

Reasoning ≠ Longer Chains: Quantity of thought tokens doesn’t guarantee quality or correctness.
Benchmarks Must Evolve: Controlled, contamination‑free environments are crucial for probing true reasoning.
Architectural Rethink Needed: Future models must integrate genuine symbolic processors or algorithmic modules—not just deeper autoregression.

Open Questions & Future Directions

Why do LRMs under‑utilize tokens under extreme difficulty? Is it a learned “early exit” policy?
Can we hybridize LLMs with discrete planners to avoid these collapse points?
How can evaluation metrics incorporate trace validity alongside final accuracy?

Conclusion: The Illusion Unmasked

Shojaee et al. deliver a compelling exposé: today’s “thinking” models often simulate reasoning, but do not embody it. Their performance is brittle, over‑reliant on data contamination, and fundamentally capped by scaling limits. To truly advance AI’s reasoning capabilities, the field must move beyond flashy CoT demos and toward architectures that can both generate and verify complex, compositional thought processes.