AI isn’t really that smart yet, Apple researchers warn
While we wait for the Age Of Apple Intelligence, it may be worth considering a recent Apple research study that exposes critical weaknesses in existing artificial intelligence models.
Apple’s researchers wanted to figure out the extent to which LLMs such as GPT-4o, Llama, Phi, Gemma, or Mistral can actually engage in genuine logical reasoning to reach their conclusions/make their recommendations.
The study shows that, despite the hype, LLMs (large language models) don’t really perform logical reasoning — they simply reproduce the reasoning steps they learn from their training data. That’s quite an important admission.
This is what Apple’s researchers found about AI
“Current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data,” the Apple team said.
They found that while these models may seem to show logical reasoning, even the slightest of changes in the way a query was worded could lead to very different answers. “The fragility of mathematical reasoning in these models [shows] that their performance significantly deteriorates as the number of clauses in a question increases,” they warned.
In an attempt to overcome the limitations of existing tests, Apple’s research team introduced GSM-Symbolic, a benchmarking tool designed to assess how effectively AI systems reason.
Not-so-smart smart bots
The research does show some strength in the models that are available today. For example, ChatGPT-4o still achieved a 94.9% accuracy rate in tests, though that rate dropped significantly when researchers made the problem more complex.
That’s good so far as it goes, but the success rate nearly collapsed — down as much as 65.7% — when researchers modified the challenge by adding “seemingly relevant but ultimately inconsequential statements.”
Those drops in accuracy reflect the limitation inherent within current LLM models, which still basically rely on pattern matching to achieve results, rather than making use of any true logical reasoning. That means these models “convert statements to operations without truly understanding their meaning,” the researchers said.
Commenting on Apple’s research, Gary Marcus, a scientist, author, AI critic, and professor of psychology and neural science at NYU, wrote: “There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer.”
Professor Marcus also pointed to some other tasty hints that Apple’s findings are correct, including an Arizona State University analysis that shows LLM performance declines as problems become greater and the inability of chatbots to play chess without making illegal moves.
What about human oversight?
All the same, the high accuracy displayed when using these machines for more conventionally framed problems suggests that, while fragile, AI will be of use as an adjunct to human decision-making.
At the very least, the data suggests that it is unwise to place total trust in the technology, as there is a tendency to failure when the underlying logic the models derive during training is stretched. It seems that AI doesn’t know what it is doing and lacks the degree of self-criticism it takes to spot a mistake when it is made.
Of course, this lack of logical coherence may be great news for some AI evangelists who frequently deny that AI deployment will cost jobs.
Why?
Because it provides an argument that humans will still be required to oversee the application of these intelligent machines. But those skilled human operators capable of spotting logical errors before they are put into action will probably need different skills than those used by the humans AI moves aside.
Move fast, break all the things
Writing in an extensive social media post explaining the report, Apple researcher Mehrdad Farajtabar warned:
“Understanding LLMs’ true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable — especially in safety, education, health care and decision making systems. Our findings emphasize the need for more robust and adaptable evaluation methods. Developing models that move beyond pattern recognition to true logical reasoning is the next big challenge for the AI community.”
I think there is another challenge as well. Apple’s research team perhaps inadvertently showed that existing models simply apply the kind of logic they have been trained to use.
The looming problem with that is the extent to which the logic chosen for use when training those models may reflect the limitations and prejudices of those who pay for the creation of those models. As those models are then deployed in the real world, this implies that future decisions taken by those models will maintain the flaws (ethical, moral, logical, or otherwise) inherent in the original logic.
Baking those weaknesses into AI systems used internationally on a day-to-day basis may end up strengthening prejudice while weakening the evidence for necessary change.
Garbage out
To a great extent, even within recent AI draft regulations, these big arguments remain completely unresolved by starry-eyed governments seeking elusive chimeras of economic growth in an age of existentially challenging crisis-driven change.
If nothing else, Apple’s teams have shown the extent to which current belief in AI as a panacea for all evils is becoming (like that anti-Wi-Fi amulet currently being sold by one media personality) a new tech faith system, given how easily a few query tweaks can generate fake results and illusion.
In the end, it really shouldn’t be controversial to think that we don’t want AI systems in charge of public transportation (including robotaxis) to end up having accidents merely because the sensors picked up confusing data that their inherent model just couldn’t figure out.
In a world of constant possibility, unexpected challenge is normal, and garbage in does, indeed, become garbage out. Perhaps we should be more deliberate in the application of these new tools? The public certainly seems to think so.
Please follow me on Mastodon, or join me in the AppleHolic’s bar & grill and Apple Discussions groups on MeWe.