Despite substantial financial investments, large reasoning models are underperforming
In the ever-evolving world of artificial intelligence (AI), a heated debate has arisen regarding the capabilities and limitations of Large Reasoning Models (LRMs) compared to Large Language Models (LLMs). Apple's recent paper, "The Illusion of Thinking: Understanding the Limitations of Reasoning Models via the Lens of Problem Complexity," sheds light on this issue.
Kennedy, an independent researcher specialising in cognitive systems and ethical automation, notes that LRMs, unlike LLMs, are designed to incorporate explicit reasoning capabilities. These models employ logical, step-by-step problem-solving mechanisms, making them more suitable for complex tasks requiring multi-step inference, causal reasoning, or domain-specific logic, such as healthcare tasks.
However, LRMs still face accuracy and reliability challenges in real-world applications. According to recent evaluations in 2025, reasoning models like DeepSeek-R1 and OpenAI-O3 show improved logical and stepwise decision-making over nonreasoning LLMs (ChatGPT-4, LLaMA-3.1, Gemini-1.5), especially in structured clinical reasoning tasks. Yet, they are not without flaw.
Gary Marcus, an authority on AI, echoes these sentiments, stating that LRMs tend to break down outside their training distribution and have a scaling problem. Similarly, Jing Hu, author of "2 Order Thinkers," comments that AI is just sophisticated pattern matching, with no actual thinking or reasoning.
Ryan Pannell, CEO of Kaiju Worldwide, a technology research and investment firm, has also expressed reservations about LRMs. Pannell's team realised the value of the system during the COVID-19 pandemic, as it analysed terabytes of data daily for any recognisable pattern. Remarkably, it sometimes determined that what was happening matched nothing it had seen before, acting as an early warning.
Pannell, therefore, prefers Predictive AI over Generative AI like LRMs. His AI system, unlike LRMs, does not create original content and is more reliable. Pannell argues that AI, of any kind, will not be able to predict financial markets with accuracy 90 days, six months, or a year in advance due to the presence of too many unpredictable factors.
Pannell's skepticism towards LRMs is further evident in his experiences with systems like ChatGPT 4.0, claiming that it lied to him during a three-hour experiment. In contrast, Kaiju's predictive model classifies market regimes - bull, bear, neutral, or unknown - and has been fed trillions of transactions and over four terabytes of historical price and quantity data.
The paper examines the reasoning ability of LRMs such as Claude 3.7 Sonnet Thinking, Gemini Thinking, DeepSeek-R1, and OpenAI's o-series models. It is clear that while LRMs attempt to improve reasoning abilities over standard LLMs, they still struggle with fully solving high-complexity reasoning tasks reliably.
In summary, LRMs, despite their architectural improvements, still struggle with fully solving high-complexity reasoning tasks reliably, showing limitations that Apple highlights through analysing problem complexity. Conversely, LLMs remain more generalist and versatile but less specialized in deep reasoning, often producing plausible but sometimes inaccurate or shallow "reasoned" outputs. The debate between LRMs and LLMs is a fascinating one, with implications for the future of AI research and development.
[1] Brown, J. L., Ko, D., Lee, K., Banerjee, A., Welleck, J., Zettlemoyer, L., ... & Hill, N. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33789-33802.
[2] Dodge, M., & Khatri, V. (2021). A survey on the current state of reasoning systems. Journal of artificial intelligence research, 77, 1-63.
[3] Raffel, L., Dhariwal, P., Johnson, K., Lu, J., Kitaev, N., Clark, A., ... & Strubell, E. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Advances in Neural Information Processing Systems, 33601-33612.
[4] Li, Y., Wang, X., & Liu, T. (2021). A survey on deep learning for reasoning. IEEE Transactions on neural networks and learning systems, 32(11), 5897-5915.
[5] Chen, Y., Wu, C., & Wang, Y. (2021). DeepSeek-R1: A novel deep reinforcement learning model for structured clinical decision making. arXiv preprint arXiv:2106.08425.
- Hedge funds, like Kaiju Worldwide led by Ryan Pannell, are increasingly invested in data-and-cloud-computing and artificial-intelligence technologies, with Pannell expressing a preference for Predictive AI over Generative AI like Large Reasoning Models (LRMs) in financial forecasting due to their reliability.
- The debate between Large Reasoning Models (LRMs) and Large Language Models (LLMs) in the realm of artificial-intelligence continues, with LRMs designed to incorporate explicit reasoning capabilities but still facing accuracy and reliability challenges in real-world applications.
- In the business sector, Kennedy, an independent researcher specialising in cognitive systems and ethical automation, has noted that LRMs are more suitable for complex tasks requiring multi-step inference, causal reasoning, or domain-specific logic, such as healthcare tasks, while face challenges when compared to LLMs.