Meta, Apple say the quiet part out loud: The genAI emperor has no clothes
Amidst the mountains of vendor cheerleading for generative AI efforts, often amplified by enterprise board members, skeptical CIOs tend to feel outnumbered. But their cynical worries may now have some company, in the form of a report from Apple and an interview from Meta — both of which raise serious questions about whether genAI can actually do much of what its backers claim.
The debate involves some fairly amorphous terms, at least when spoken in a computing environment context — things like reasoning and logic. When a large language model (LLM), for example, proposes a different and ostensibly better way to do something, is it because its sophisticated algorithm has figured out a better way? Or is it just wildly guessing, and sometimes it gets lucky? Or did it hallucinate something and accidentally say something helpful?
Would a CIO ever trust a human employee with such tendencies? Not likely, but IT leaders are regularly tasked with integrating genAI tools into the enterprise environment by corporate executives expecting miracles.
The conclusions drawn by AI experts from Apple and Meta may help CIOs set more realistic expectations about what genAI models can and cannot do, now and in the near future.
GenAI is not that intelligent
The Apple report, which was the more detailed research effort, is also the more damning of the two. Its authors stated:
“Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered.
“Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases… When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer.”
What does mathematical reasoning have to do with AI-powered business applications? The Apple research team spelled it out:
“Mathematical reasoning is a crucial cognitive skill that supports problem-solving in numerous scientific and practical applications. Consequently, the ability of large language models (LLMs) to effectively perform mathematical reasoning tasks is key to advancing artificial intelligence and its real-world applications.”
What today’s state-of-the-art LLMs do is not logical reasoning, the researchers concluded:
“Current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data… It may resemble sophisticated pattern matching more than true logical reasoning.”
Meta’s analysis comes by way of an interview with The Wall Street Journal featuring AI legend Yann LeCun, who today serves as the chief AI scientist at Meta. In the story, LeCun called the notion that AI will soon become advanced enough to pose a threat to humanity “complete B.S.” Like the Apple researchers, he said AI is a powerful tool but not truly intelligent, according to interviewer Christopher Mims:
“When a departing OpenAI researcher in May talked up the need to learn how to control ultra-intelligent AI, LeCun pounced. ‘It seems to me that before “urgently figuring out how to control AI systems much smarter than us,” we need to have the beginning of a hint of a design for a system smarter than a house cat,’ he replied on X.
“He likes the cat metaphor. Felines, after all, have a mental model of the physical world, persistent memory, some reasoning ability and a capacity for planning, he says. None of these qualities are present in today’s ‘frontier’ AIs, including those made by Meta itself.”
Later, the WSJ story lets LeCun make his central point:
“Today’s models are really just predicting the next word in a text, he says. But they’re so good at this that they fool us. And because of their enormous memory capacity, they can seem to be reasoning, when in fact they’re merely regurgitating information they’ve already been trained on.
“‘We are used to the idea that people or entities that can express themselves, or manipulate language, are smart — but that’s not true,’ says LeCun. ‘You can manipulate language and not be smart, and that’s basically what LLMs are demonstrating.’”
That is the key issue. Enterprises are putting far too much faith in genAI systems, says Francesco Perticarari, general partner at technology investment house Silicon Roundabout Ventures in London, England.
It’s easy to assume that the rare correct answers given by these tools are flashes of brilliance, rather than the genAI having gotten a lucky guess. But “the output is not based at all on reasoning. It is merely based on extremely powerful computing,” Perticarari said.
Putting genAI in the driver’s seat
One frequently cited selling point for genAI is that some models have proven quite effective at passing various state bar exams. But those bar exams are ideal environments for genAI, because the answers are all published. Memorizations and regurgitation are ideal uses for genAI, but that doesn’t mean genAI tools have the skills, understanding, and intuition to practice law.
“The logic is that if genAI can pass the bar exam, it can handle my business, build systems that are robust and that work now,” said Alan Nichol, co-founder and CTO of AI vendor Rasa. “[Business leaders] are taking this dangerous, naive approach and just letting the LLM figure it out,” he said.
Nichol pointed to Apple’s analysis that the more complex and multilayered math problems got, the more the LLMs got lost and confused.
“It’s supposed to understand this math, but something is definitely fishy. The medium through which they are doing [these calculations] is natural language. It’s fuzzy and imprecise,” he said. “Language models were never supposed to do a lot of these things. There are vanishingly few situations where you want your software to guess what it should be doing, what the next few steps should be.”
Nichol stressed that these systems, left to their own devices, are reckless. “Four out of five times, genAI doesn’t follow its own instructions,” he said. “You want it to guess business logic? It just doesn’t work and is extremely slow and consumes a tremendous amount of tokens.”
Perticarari from Silicon Roundabout Ventures is especially concerned about hallucinations coupled with the lack of meaningful guardrails. GenAI seems to easily overcome — or be tricked by a user into overcoming — many of the safeguards organizations attempt to place around it.
“If you have a one-year-old, you wouldn’t give her a loaded gun and then try and explain to her why she shouldn’t shoot you,” Perticarari said. “[GenAI is] not sentient. Humans are sentient and they assume the system is intelligent, too. Letting genAI run on autopilot to me is crazy. Don’t give anything to a black box.”
Fighting FOMO
Perticarari blames enterprise executives and board members for falling victim to countless AI sales pitches. He says that CIOs have to be the voice of sanity.
“It is always easy during a gold rush to sell hype. [Sales execs] just keep delivering endless layers of selling without really understanding,” Perticarari said. “CIOs need to ask, ‘How fundamental and vital is the task that [we] are outsourcing to genAI?’”
Jake Reynolds, the CTO at cybersecurity vendor Wirespeed, agrees. He maintains that a lot of the rush to genAI has been pushed by board members, and “the CIO had to tag along.”
Executives are giving in to FOMO (fear of missing out), thinking that “their largest competitor is doing it, so we are going to do it,” he said. “But it doesn’t deliver. Even with the more objective mathematics, it starts falling apart. Try to get consistency out of it. You can’t. The words it predicts changes every time you tweak a little knob… Are you really OK with your product only working 80% of the time?”
Reynolds encourages CIOs to slow down and be as minimalistic as practical. “We’re not laggards. We’re just realists about what the technology can really do,” he said.
Judicious use of genAI tools can mitigate disappointment or worse, agrees Nichol. “We should just let the LLMs do what the LLMs are amazing at. Don’t let the LLM do everything.”