A new study from ETH Zurich and INSAIT has shown that modern AI models that simulate reasoning and reliably solve standard mathematical problems are almost incapable of formulating complete proofs at the level of the 2025 United States Mathematical Olympiad (USAMO). These results call into question the ability of modern AI models to perform deep mathematical reasoning.
Image source: Imkara Visual / Unsplash
In March 2025, a research team from ETH Zurich and the Institute of Computer Science, Artificial Intelligence and Technology (INSAIT) at Sofia University, led by Ivo Petrov and Martin Vechev, published a preprint of a scientific paper titled “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad”. The work aims to evaluate the ability of large language models (LLMs) that simulate reasoning to generate complete mathematical proofs on Olympiad problems.
The analysis used six problems from the 2025 USAMO, organized by the Mathematical Association of America. The AI models were tested immediately after the problems were published to minimize the risk of data leakage into the training sets. The average performance of all AI models in generating full proofs was less than 5% of the maximum possible points. The systems were scored on a scale from 0 to 7 points per problem, taking into account partial credits given by experts. Only one model, Google’s Gemini 2.5 Pro, showed a significantly better result, scoring 10.1 points out of 42 possible, which is equivalent to about 24%. The other models lagged significantly behind: DeepSeek R1 and Grok 3 received 2.0 points each, Gemini Flash Thinking – 1.8 points, Claude 3.7 Sonnet – 1.5 points, Qwen QwQ and OpenAI o1-pro – 1.2 points each. OpenAI’s o3-mini-high AI model scored just 0.9. Out of nearly 200 solutions generated, none scored the maximum.
The study highlights a fundamental difference between problem solving and constructing mathematical proofs. Standard problems, such as calculating the value of an expression or finding a variable, require only a finite correct answer. In contrast, proofs require a consistent logical argumentation explaining the truth of a statement for all possible cases. This qualitative difference makes USAMO-level problems significantly more demanding in terms of the depth of reasoning.
Screenshot of USAMO 2025 Challenge #1 and its solution on AoPSOnline. Image source: AoPSOnline
The authors of the study identified characteristic error patterns in the work of AI. One of them was the inability to maintain correct logical connections throughout the entire chain of inference. Using the example of the 2025 USAMO task #5, AI models had to find all natural values of k for which a certain sum of binomial coefficients to the power of k remains an integer for any positive n. The Qwen QwQ model made a gross error by excluding possible non-integer values allowed by the conditions of the problem, which led to an incorrect final conclusion, despite the correct definition of conditions at intermediate stages.
A characteristic feature of the models’ behavior was that even in the case of serious logical errors, they formulated their decisions in an affirmative form, without any signs of doubt or indications of possible contradictions. This property of simulating reasoning indicates the absence of internal self-checking and output correction mechanisms in AI models.
Accuracy rates of AI models on each USAMO 2025 task. Image source: MathArena
Despite the identified limitations, the implementation of chain of thoughts and simulated reasoning methods has had a positive effect on the formation of intermediate logical steps in the inference process of AI models. The mechanism for scaling computations at the inference stage allows AI to build more coherent local reasoning. However, the fundamental problem remains: modern large language models (LLM) on the Transformer architecture continue to operate as pattern recognition systems, rather than as independent conceptual reasoning systems.
The improved performance of the Gemini 2.5 Pro model suggests the potential to narrow the gap between simulated and real reasoning in the future. However, achieving qualitative progress requires training AI models on deeper multidimensional relationships in the latent space and mastering the principles of constructing new logical structures, rather than just copying existing patterns from training samples.