A Meta spokesperson has denied rumors that the company has been deliberately improving the performance of its new Llama 4 AI models in benchmarks. Ahmad Al-Dahle, vice president of generative AI, said in a post on X that claims that the company was rigging the results to hide weaknesses in the Maverick and Scout models are “simply not true.”
Источник изображения: Mariia shalabeeeva / Unsplash
Rumors of manipulation appeared on social media after a former Meta employee posted. The user of the Chinese platform claimed that he quit the company in protest against “unfair testing methods.” These accusations later spread to X (former Twitter) and Reddit, TechCrunch writes.
However, Al-Dahle stressed that Meta did not train the Llama 4 Maverick and Llama 4 Scout models on “test datasets,” which are special samples used to evaluate AI. Such a practice could artificially inflate the results, creating a false impression of the models’ capabilities.
The suspicions initially arose due to differences in the performance of Llama 4 Maverick on different platforms. Researchers noticed that the version of the model in the LM Arena benchmark behaved differently than the publicly available one and did not cope with certain tasks. In addition, Meta used an experimental build of Maverick to improve test results, which also raised questions.
At the same time, Al-Dahle notes that the reason why users are currently experiencing unstable quality of models may be related to the settings of cloud providers on whose servers the scripts are hosted. “We released the models immediately after they were ready, and it will take a few days for all public implementations to be configured according to our requirements,” he explained. Meta promised to continue working on fixing Llama 4 bugs in any case for developers to quickly integrate them into their projects.