OpenAI’s SimpleQA tool for discerning genAI accuracy — right message, wrong messenger
In the ongoing and potentially futile effort by CIOs to squeeze meaningful ROI out of their shiny, new generative AI (genAI) tools, there is no more powerful villain than hallucinations. It is what causes everyone to seriously wonder whether the analysis genAI delivers is valid and usable.
From that perspective, I applaud OpenAI for trying to create a test to determine objective accuracy for genAI tools. But that effort — called SimpleQA — fails enterprise tech decision-makers in two ways. First, OpenAI is the last business any CIO would trust to determine the accuracy of the algorithms it is selling. Would you trust an app that determines the best place to shop from Walmart, Target or Amazon — or perhaps a car evaluation tool from Toyota or GM?
The second problem is that SimpleQA focuses on, well, simple stuff. It looks at objective and simple questions that ostensibly have only one correct answer. More to the point, the answer to those questions is easily determined and verified.
That is just not how most enterprises want to use genAI technology. Eli Lilly and Pfizer want it to find new drug combinations to cure diseases. (Sorry, that should be “treat.” Treat makes companies money forever. Cure’s revenue is large, but ends far too quickly.) Yes, it would test those treatments afterwards, but that is a lot of wasted effort if genAI is wrong. Costco and Walgreens want to use it to find the most profitable places to build new stores. Boeing wants it to come up with more efficient ways to build aircraft.
Let’s delve into what OpenAI created. For starters, here’s OpenAI’s document. I’ll put the company’s comments into a better context.
“An open problem in artificial intelligence is how to train models that produce responses that are factually correct.” Translation: We figured it would be nice to have it give a correct answer every now and then.
“Language models that generate more accurate responses with fewer hallucinations are more trustworthy and can be used in a broader range of applications.” Translation: Call us hippies, if you must, but we brainstormed and concluded that our revenue could be improved if our product actually worked.
Those flippant comments aside, I want to acknowledge that OpenAI makes a good faith effort here to come up with a basic way to evaluate precision where concrete answers can be ascertained. Setting aside how valuable that is in an enterprise setting, it’s a good start.
But instead of creating the test itself, it would have been far more credible if it funded a trusted third-party consulting or analyst firm to do the work, with a firm hands-off policy so IT could trust that the testing was not biased in favor of OpenAI’s offerings.
Still, something is better than nothing, so let’s look at what OpenAI said.
“SimpleQA is a simple, targeted evaluation for whether models ‘know what they know’ (and give) responses (that) are easy to grade because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer.”
If you think through why this approach works — orseems like it would work — it becomes clear why it might not be helpful. This approach suffers from a critical flawed assumption. If the model can accurately answer these questions, then that tells us that it will likely be able to answer other questions with the same accuracy.
That might work with a calculator, but the nature of genAI hallucinations makes that assumption flawed. GenAI can easily get 10,000 questions correct and it might then wildly hallucinate for the next 50.
The nature of hallucinations is that they tend to happen randomly with zero predictability. That is why spot-checking, which is pretty much what SimpleQA is trying to do, won’t work here.
To be more specific, it wouldn’t be meaningful if genAI tools were to get all of the SimpleQA answers right. But the reverse isn’t true. If the tested model gets all or most of the SimpleQA answers wrong, that does tell IT quite a bit. From the technology’s perspective, the test seems unfair. If it gets an A, it will be ignored. If it gets an F, it will be believed. As the computer said in WarGames (a great movie to watch to see what a genAI system might do at the Pentagon), “The only winning move is not to play.”
OpenAI pretty much concedes this in the report: “In this work, we will sidestep the open-endedness of language models by considering only short, fact-seeking questions with a single answer. This reduction of scope is important because it makes measuring factuality much more tractable, albeit at the cost of leaving open research questions such as whether improved behavior on short-form factuality generalizes to long-form factuality.”
Later in the report, OpenAI elaborates: “A main limitation with SimpleQA is that while it is accurate, it only measures factuality under the constrained setting of short, fact-seeking queries with a single, verifiable answer. Whether the ability to provide factual short answers correlates with the ability to write lengthy responses filled with numerous facts remains an open research question.”
Here are the specifics: SimpleQA consists of 4,326 “short, fact-seeking questions.”
Another component of the SimpleQA test is that the question-writer bears much of the responsibility, rather than the answer-writer. “One part of this criterion is that the question must specify the scope of the answer. For example, instead of asking ‘Where did Barack and Michelle Obama meet’ which could have multiple answers such as ‘Chicago’ or ‘the law firm Sidley & Austin,’ questions had to specify ‘which city’ or ‘which company.’ Another common example is that instead of asking simply ‘when,’ questions had to ask ‘what year’ or ‘what date.’”
That nicely articulates why this won’t likely be of use in the real world. Enterprise users are going to ask questions in an imprecise way. They have been sold on the promise of “just use natural language” and the system will figure out what you really mean through context. This test sidesteps that issue entirely.
So, how can the results be meaningful or reliable?
The very nature of hallucinations belies any way to quantify them. If they were predictable, IT could simply program their tools to ignore every 75th response. But it’s not. Until someone figures out how to truly eliminate hallucinations, the lack of reliable answers will stay with us.