A new study from Microsoft Research found that while AI can help developers write code, even the best models from OpenAI (o1) and Anthropic (Claude 3.7 Sonnet) can fix errors no more than half the time. The testing was conducted using the leading SWE-bench benchmark, which measures the ability of AI systems to write code.

Image source: AI generated

During the experiment, AI agents tried to solve 300 tasks to eliminate errors in code. The leader was the Claude 3.7 Sonnet model, which completed the task with a success rate of 48.4%, second place was taken by OpenAI o1 (30.2%), and third place was taken by o3-mini (22.1%). However, as you can see, even these figures are far from the level that could be expected from experienced human programmers. As TechCrunch explains, the main problem is that artificial intelligence still has a poor understanding of how to use available tools and interpret errors.

According to the authors of the study, the key obstacle remains the lack of data for training the models. “We strongly believe that training or retraining can make them better interactive debuggers,” they write. “However, this requires specialized data, such as a chain of records of all human interactions with AI debuggers.”

Currently, such data is scarce, which limits the models’ capabilities. For example, the popular Devin tool from startup Cognition Labs could only cope with three out of 20 coding tests for this reason. And while AI is actively used by companies like Google, according to CEO Sundar Pichai, a quarter of the code that is created with the help of artificial intelligence may actually introduce errors.

Tech leaders are skeptical about the complete automation of the programming profession. Bill Gates is confident that programming as a profession is certainly not going away. Replit CEO Amjad Masad, Okta CEO Todd McKinnon, and IBM CEO Arvind Krishna share a similar opinion.

Despite the obvious problems, interest in AI development tools continues to grow. Investors see potential for efficiency gains, but leading developers believe it’s too early to fully trust AI.

Leave a Reply

Your email address will not be published. Required fields are marked *