This week, OpenAI released its o3 and o4-mini AI models. They are cutting-edge in many ways, but they are more likely to hallucinate — or confidently give answers that are not true — than their predecessors.
Image source: Mariia Shalabaieva / unsplash.com
The hallucination problem remains one of the largest and most challenging in modern AI, affecting even the most effective modern systems. Historically, each successive model has shown improvements in this area, meaning fewer hallucinations than previous versions. However, that doesn’t appear to be the case with the o3 and o4-mini. OpenAI’s new systems hallucinate more often than the company’s previous reasoning models, including the o1, o1-mini, and o3-mini, as well as traditional “non-reasoning” models like GPT-4o, according to the developer’s own tests.
What’s a bit troubling is that the company itself doesn’t know why this is happening. In a technical report (PDF), it says “more research is needed” to understand why the frequency of hallucinations increases as reasoning models scale. OpenAI’s o3 and o4-mini perform better than their predecessors on a range of tasks, including math and programming, but because they “make more claims overall,” they also tend to make both “more accurate claims” and “more imprecise or hallucinatory claims,” the developer’s report says.
In OpenAI’s own PersonQA test designed to assess models’ knowledge of people, o3 hallucinated 33% of the time, roughly double the rate of previous reasoning models o1 and o3-mini (16% and 14.8%, respectively). The o4-mini model hallucinated 48% of the time in the same test. A third-party test by an independent developer, Transluce, found that o3 tends to make up the actions it supposedly took when preparing its responses. In one case, it claimed to have run code on a 2021 Apple MacBook Pro “outside of ChatGPT” and copied numbers into its response. While o3 does have access to some tools, it could not have performed such an action.
According to one version, the problem with hallucinations, the frequency of which had previously decreased when standard tools were connected after the main training stage, could, on the contrary, have worsened due to the use of the reinforcement learning type used for the “o” series models. As a result, the OpenAI o3 model may not be useful enough in practice, experts believe. It was also found that in programming tasks, it significantly outperforms other models, but sometimes adds broken links to websites to the code.
One promising approach to reducing hallucinations is to open up web search capabilities to models. GPT-4o, for example, achieved 90% correct answers on OpenAI’s SimpleQA benchmark. It’s possible that this approach could work for reasoning models as well. “Removing hallucinations across all of our models is an ongoing area of research, and we’re constantly working to improve their accuracy and robustness,” OpenAI told TechCrunch.