The Shocking Truth Behind AI “Reasoning”-Apple’s New Paper Destroys the Illusion!

++ Google Teases Gemini 2.5 Pro, UK Partners with NVIDIA, Sarvam's Multilingual AI, Runway's Film Festival, Indian Banks Mandated AI, and More

Jun 10, 2025

You are reading the 101st edition of the The Responsible AI Digest by SoRAI (School of Responsible AI) . Subscribe today for regular updates!

At the School of Responsible AI (SoRAI), we empower individuals and organizations to become AI-literate through comprehensive, practical, and engaging programs. For individuals, we offer specialized training such as AI Governance certifications (AIGP, RAI) and an immersive AI Literacy Specialization. This specialization teaches AI using a scientific framework structured around four levels of cognitive skills. Our first course is now live and focuses on the foundational cognitive skills of Remembering and Understanding. Want to learn more? Explore all courses: [Link] Write to us for customized enterprise training: [Link]

🔦 Today's Spotlight

As reasoning capabilities in large language models (LLMs) became a central focus in 2024–2025, Apple’s landmark paper, “The Illusion of Thinking,” sounded a cautionary note. It argued that today’s models may look intelligent on benchmarks but fail on truly complex tasks, exposing limitations in their reasoning depth and generalization. The paper might trigger introspection in the AI community, prompting scrutiny of models like OpenAI’s GPT-4/o1, Google’s Gemini, Anthropic’s Claude, and open-source contenders.

OpenAI’s GPT-4 and o1 showed the promise of chain-of-thought (CoT) prompting, self-correction, and tool use to perform better on complex math and programming problems. Yet, even with these mechanisms, models often hallucinate, show inconsistency, or mimic reasoning without real understanding. While o1 improved over GPT-4 with reinforcement learning and process supervision, both still collapse on tasks beyond their training distribution.
Anthropic’s Claude, known for transparency through visible CoT and a principle-driven approach (Constitutional AI), added an important safety lens. However, experiments revealed that Claude sometimes fabricates its reasoning- omitting influences like hidden hints - thereby undermining the faithfulness of its reasoning trace. Despite large context windows and improved alignment, it still suffers from the same brittleness Apple highlighted.
Google’s PaLM 2 and Gemini 2.5 advanced reasoning via Deep Think modes, retrieval, and multimodal inputs. Gemini in particular integrated tools and planning mechanisms. However, both Bard and Gemini faced challenges with hallucinations, weak visual reasoning, and performance drop-offs on novel or complex prompts. This echoed Apple’s finding that longer reasoning chains do not equate to better answers when complexity scales.
Meta’s LLaMA and open-source models attempted reasoning via fine-tuned CoT examples, retrieval-augmented generation, and lightweight tool use. While these efforts democratized reasoning techniques, they typically showed high verbosity but lower correctness- demonstrating the gap between mimicking reasoning and achieving it. Without large-scale reinforcement training, these models often regressed to pattern matching.
Multimodal reasoning models such as GPT-4V and PaLM-E extended CoT into the visual domain, handling images and video. Yet, they introduced new limitations: visual hallucinations, difficulty with spatial logic, and challenges integrating cross-modal evidence. While visual CoT helped, models still struggled to maintain logical coherence across modalities and over time.

Conclusion: Apple’s “illusion of thinking” argument can be substantiated across the reasoning landscape. While models now appear more reflective, tool-using, and step-by-step in their logic, their capabilities falter as complexity increases. From inconsistent CoT to shallow pattern-matching, even the best models of 2025 collapse when benchmarks become truly rigorous. Though progress is real- via process supervision, tool integration, and planning- the foundational challenge remains: current models simulate reasoning better than they perform it. Turning the illusion into reality will require not just better data or larger models, but new training paradigms, architectures, and evaluation frameworks that align with authentic reasoning under real-world conditions.

🚀 AI Breakthroughs

Apple Introduces New AI Features for iOS 26 at WWDC 2025

• Apple announced AI-enhanced live translation across Messages, FaceTime, and Phone apps, enabling real-time text, call, and FaceTime audio translations, set to debut with iOS 26 later this year

• Enhanced Apple Intelligence Visual features now let users perform internet searches by capturing and highlighting objects, similar to Android's 'Circle to Search' capability

• Apple Watch introduces 'Workout Buddy', which uses AI to analyze workout data and offers personalized motivational insights, showcasing expanded AI integration beyond previous uses.

The Responsible AI Digest by School of Responsible AI- SoRAI

The Shocking Truth Behind AI “Reasoning”-Apple’s New Paper Destroys the Illusion!

++ Google Teases Gemini 2.5 Pro, UK Partners with NVIDIA, Sarvam's Multilingual AI, Runway's Film Festival, Indian Banks Mandated AI, and More

Today's highlights:

You are reading the 101st edition of the The Responsible AI Digest by SoRAI (School of Responsible AI) . Subscribe today for regular updates!

🔦 Today's Spotlight

🚀 AI Breakthroughs

Apple Introduces New AI Features for iOS 26 at WWDC 2025

Google Launches Gemini 2.5 Pro Preview with Enhanced Performance and Improved Features

Hugging Face and ModelScope Host New Qwen3 Embedding Models for Text Tasks

ElevenLabs Unveils Eleven v3 Model with Enhanced Emotional Control and Language Support

Anthropic Releases Claude Gov AI Models for Enhanced National Security Applications

Krutrim Unveils 'Kruti', India's First Agentic AI Assistant with Three Modes

Sarvam Launches AI Platform for Multilingual Enterprise Communication Across 11 Indian Languages

The UK Strengthens Position as Europe's AI Leader with NVIDIA Partnerships Focused on Skills

Google Colab Adds One-Click Support for Hugging Face Model Launching

A New Wave in Cinema: Runway's AI Film Festival Showcases Generative Video Innovation

Agentic Search Tool Launched for Enhanced Automation and Structured Web Research

⚖️ AI Ethics

Apple Navigates AI Advancements and Regulatory Pressures Amid Developer Conference Spotlight

Indian Banks Urged to Implement AI and PETs for DPDPA Compliance

Chinese AI Chatbots Temporarily Disable Photo Recognition to Prevent Exam Cheating

Tokasaurus Enhances LLM Inference with 3x Throughput Boost Over Competitors

Judges Warn Lawyers Against Using AI for Fabricated Legal Case References

OpenAI Academy India Launches to Boost AI Skills Nationwide with Diverse Training

FutureHouse Releases First AI Reasoning Model for Advancing Chemistry Research Applications

Large Reasoning Models Struggle with Complex Tasks: A Deep Dive into Limitations

EU AI Act Compliance Standards Delayed, Facing Extended Development Into 2026

Chicago Sun-Times Pulls AI-Generated Book List After Fake Titles Discovered

Emotional Dependency on AI Chatbots Raises Concerns Over Safety and Manipulation Risks

🎓AI Academia

Analysis Reveals Critical Challenges in Large Reasoning Models for Complex Problems

SafeLawBench Offers Legal Grounding for Evaluating Large Language Model Safety Risks

Systematic Review Highlights Security Risks of Poisoning Attacks on Language Models

Financial Time-Series Forecasting Enhanced by New Retrieval-Augmented Language Models Framework

POISONBENCH Benchmark Exposes Language Models' Vulnerability to Data Poisoning Attacks

Study Examines Deepfake Threats to Biometric Systems and Public Perception Gap

Large Language Models Set to Enhance Roadway Safety in Transportation Systems

Public Sector Faces Challenges in Oversight of Agentic AI Deployment, Study Reveals

OpenHands-Versa: A Generalist AI Agent Excels in Diverse Task Performance

OpenThoughts Project Releases Breakthrough Open-Source Datasets for Training Reasoning Models

Discussion about this post