Comparing AI models is notoriously difficult, and their creators are often accused of bias, partiality, and making test results difficult for ordinary people to understand. So rather than focusing on abstract math and logic tests, the researchers proposed testing the AI ​​using Nintendo’s classic platformer Super Mario Bros.

Image source: Hao AI Lab

The experiment used an emulated version of Super Mario Bros. that was integrated with a custom framework called GamingAgent from researchers at the Hao AI Lab at the University of California, San Diego. This system allowed AI models to control Mario by generating Python code. All models were given the same basic instructions, like “Jump over this enemy,” as well as visualizations of the game state in the form of screenshots.

While Super Mario Bros. may look like a simple 2D platformer, researchers have found that the classic Nintendo game seriously challenges AI to plan complex movement sequences and adapt gameplay strategies on the fly.

The best model in mastering Super Mario Bros. was recognized by the researchers as Claude 3.7 from Anthropic, which demonstrated impressive reflexes, stringing together precise jumps and skillfully avoiding enemies. Its predecessor, Claude 3.5, also showed decent results, while OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro lagged behind the competition.

As it turns out, logical thinking isn’t the key to success in Super Mario Bros. — timing is. Even a small delay can send Mario back to a previous checkpoint. The researchers suggest that the more “conscious” and reasoning models may have taken too long to figure out their next steps, leading to frequent failures.

Of course, using retro games to evaluate AI is largely an experiment. An AI’s ability to beat Super Mario Bros. doesn’t determine how useful it really is, though watching models trained on billions of parameters battle (and often lose) against a seemingly childish game is certainly entertaining.

For those who want to conduct their own experiment, Hao AI Lab has opened the source code of its GamingAgent on GitHub.

Leave a Reply

Your email address will not be published. Required fields are marked *