All Modern AIs Fail a New Complex General Intelligence Test—and Humans Done Less Than Perfectly

A new test for assessing the general intelligence of AI models, called ARC-AGI-2, has stumped most AI models. According to the ranking, reasoning models such as OpenAI’s o1-pro and DeepSeek’s R1 scored between 1% and 1.3%. Non-reasoning models, including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, scored less than 1%.

Image source: Pixabay

The Arc Prize Foundation, a non-profit organization co-founded by renowned AI researcher François Chollet, announced on its blog that it has created a new, more advanced test to measure the general intelligence of leading AI models.

The ARC-AGI-2 test is a series of puzzles in which the AI ​​must recognize visual patterns by analyzing colored squares and, based on this, construct the correct continuation of the pattern. The test is specifically designed so that the models cannot rely on past experience and are forced to adapt to new tasks.

The Arc Prize Foundation also conducted testing with over 400 people. On average, the groups of subjects answered 60% of the questions correctly. This significantly exceeds the performance of all the AI ​​tested, and at the same time highlights the gap between current AI capabilities and human intelligence in solving problems that require adaptation and understanding of new concepts.

Chollet said that ARC-AGI-2 is a more accurate measure of the actual intelligence of AI models than the previous version of the test, ARC-AGI-1. In addition, ARC-AGI-2 eliminates the possibility of solving problems by “brute force” — that is, by using huge computing power to try all possible options, which occurred in the ARC-AGI-1 test and was considered a serious flaw.

To address the first test’s shortcomings, ARC-AGI-2 introduced an efficiency metric that forced the AI ​​to interpret patterns on the fly rather than rely on memorization. Arc Prize Foundation co-founder Greg Kamradt noted that “intelligence is not just about the ability to solve problems or achieve high results, but also about the efficiency with which those capabilities are acquired and deployed.”

ARC-AGI-1 remained the leading metric for about five years until OpenAI released its advanced reasoning model o3 in December 2024. This model outperformed all other AI models and even matched human performance on ARC-AGI-1 benchmarks. However, as noted, these gains came at the cost of significant computational effort.

The development of the new benchmark comes as concerns in the industry grow over the lack of objective benchmarks for evaluating artificial intelligence. In response, the Arc Prize Foundation has announced the Arc Prize 2025, which challenges developers to achieve 85% accuracy on ARC-AGI-2 while spending no more than $0.42 per task.

admin

Share
Published by
admin

Recent Posts

Nissan Leaf EV to Become NACS-Ported Compact Crossover in Third Generation

Nissan Leaf can rightfully be considered a long-liver of the electric car market, since the…

3 days ago

OpenAI expects to more than triple its revenue this year and then double it next year.

OpenAI, the market leader in generative artificial intelligence systems, remains nominally a startup, its financial…

3 days ago

OpenAI Decides to Hold 4o Image Generation Launch for Free Users

OpenAI has been forced to delay the release of ChatGPT's built-in image generator for free…

3 days ago

1440p and 240Hz for just $200: Xiaomi updates the 27-inch Redmi G27Q gaming monitor

Xiaomi continues to update its Redmi G27Q gaming monitor every year. The model was first…

3 days ago

Beware, Android is shutting down: OS development will cease to be public, but there is no reason to panic

Android device makers can significantly customize the look and feel of the operating system, but…

3 days ago

Fake GeForce RTX 4090s with RTX 3090 chips have started popping up in China — craftsmen are even changing the GPU markings

In China, scammers have started selling GeForce RTX 3090 graphics cards, passing them off as…

3 days ago