Last December, OpenAI unveiled its large language model o3, claiming it could handle more than 25% of the FrontierMath set of complex math problems, compared to just 2% of other AI models. However, discrepancies between internal and independent tests raised questions about the company’s transparency and testing practices.
Image source: Levart_Photographer / unsplash.com
At the time of the announcement of the o3 AI model, a company representative highlighted the algorithm’s performance on FrontierMath problems. However, the consumer version of the algorithm released last week does not perform nearly as well on the calculations. This may indicate that OpenAI either inflated the test results or used a different, more math-capable version of o3.
Epoch AI, the team behind FrontierMath, has released independent tests of its publicly available version of the o3 AI model. It found that the algorithm was only able to handle 10% of the tasks, well below OpenAI’s claimed 25%. The researchers also tested the o4-mini AI model, a smaller, cheaper algorithm that is the successor to o3-mini.
Image source: @EpochAIResearch / X
Of course, the discrepancy in the test results does not mean that OpenAI intentionally inflated the AI model’s performance. The lower bound of OpenAI’s test results is almost identical to the results obtained by Epoch AI. Epoch AI also noted that the model they tested is likely different from the one tested by OpenAI. It is also noted that the researchers used an updated version of the FrontierMath problem set.
«The difference between our results and OpenAI’s could be due to OpenAI evaluating the results using a more powerful internal version, using more computation time, or because these results were obtained on a different subset of FrontierMath (180 problems in frontiermath-2024-11-26 vs. 290 problems in frontiermath-2025-02-28),” Epoch AI said in a statement.
According to the ARC Foundation, which tested the o3 preview, the public version of the AI algorithm “is a different model” that is optimized for chat/product use. “All released versions of o3 are lower in computational power than the version we tested,” ARC said.
OpenAI’s Wenda Zhou said the public version of o3 is “more optimized for real-world use cases” and improves query speed compared to the version of o3 the company tested in December. She said that’s why benchmark results may differ from what OpenAI shows.