Microsoft claims new ‘Correction’ tool can fix genAI hallucinations
Generative AI (genAI) and “hallucinations” go hand in hand, regardless of how well the large language models behind genAI tools are trained.
So, Microsoft on Tuesday unveiled Correction, a new capability within its existing Azure AI Content Safety tool that it said can ferret out, then correct, genAI responses that aren’t directly connected to data sources used to train an LLM — in other words, hallucinations.
“Empowering our customers to both understand and take action on ungrounded content and hallucinations is crucial, especially as the demand for reliability and accuracy in AI-generated content continues to rise,” Microsoft said in a blog post.
While “add-on tools” can help double-check the accuracy of the LLM outputs, Gartner has found that using better search techniques for Retrieval Augmented Generation (RAG) or grounding should be a first step to mitigate hallucinations. “We advise clients to use search to provide information to ground the LLM response in an enterprise context,” said Gartner Distinguished Vice President Analyst Jason Wong.
RAG is a method of creating a more customized genAI model through a series of questions and answers provided to an LLM that enables more accurate and specific responses to queries.
Microsoft
Along with Google, a number of startups and other cloud service providers have been offering tools to monitor, evaluate and correct problems with genAI results in the hopes of eliminating systemic problems.
Microsoft’s Correction tool was among several AI feature updates that included Evaluations in Azure AI Studio, a risk assessment tool, and Hybrid Azure AI Content Safety (AACS), an embedded SDK for on-device AI processing.
Correction is available as part of Microsoft’s Azure AI Content Safety API, which is currently in preview; it can be used with any text-based genAI model, including Meta’s Llama and OpenAI’s GPT-4o.
Analysts, however, are skeptical about how effective Correction will be at eliminating errors. “Hallucinations continue to dog generative AI implementations,” said Wong. “All the hyperscalers have launched products to mitigate hallucinations, but none promise eliminating [them] all together or even reaching certain thresholds of accuracy.”
Microsoft first introduced its “groundedness” detection feature in March. To use it, a genAI application must connect to grounding documents, which are used in document summarization and RAG-based Q&A scenarios, Microsoft said. Since then, it said, customers have been asking what they can do once erroneous information is detected, besides blocking it.
“This highlights a significant challenge in the rapidly evolving generative AI landscape, where traditional content filters often fall short in addressing the unique risks posed by generative AI hallucinations,” Microsoft Senior Product Marketing Manager Katelyn Rothney wrote in a blog post.
Building on the company’s existing groundedness detection, the Correction tool allows Azure AI Content Safety to both identify and correct hallucinations in real-time — before users of genAI applications encounter them. It works by first flagging the ungrounded content. Then the Azure Safety system initiates a rewriting process in real-time to revise the inaccurate portions ensure alignment with connected data sources.
Microsoft
“This correction happens before the user is able to see the initial ungrounded content,” Rothney said. “Finally, the corrected content is returned to the user.”
The hallucinogenic nature of genAI technology, like OpenAI’s GPT-4 — the basis for Microsoft’s AI — Meta’s Llama 2 and Google’s PaLM 2, occurs because their foundational models are based on massive, amorphous, unspecific parameters or options from which the algorithm can choose answers.
While genAI is most often highly accurate in providing answers to queries, it is also prone to gathering information from places it was never meant to go, just so it can provide a response, any response.
In fact, LLMs have been characterized as stochastic parrots — as they get larger, they become more random in their conjectural or random answers. Essentially, the “next-word prediction engines” just continue to parrot what they’ve been taught, but without a logic framework.
One study from Stanford University this year found genAI makes mistakes when answering legal questions 75% of the time. “For instance,” the study found, “in a task measuring the precedential relationship between two different [court] cases, most LLMs do no better than random guessing.”
Optimizing the search infrastructure by incorporating both lexical and semantic search increases the likelihood that only relevant information is passed to the LLM, Wong said.
“While this can significantly reduce the likelihood of hallucinations, it still cannot eliminate them,” he said. “The quality of the information retrieved for RAG largely determines the output quality, making content management and governance essential as a starting point for minimizing hallucinations.”