A five-month investigation by SemiAnalysis found that the AMD MI300X series of dedicated AI accelerators were not reaching their full potential due to serious software issues. This fact makes all the company’s efforts to impose fierce competition from Nvidia, which dominates the AI hardware market, pointless.
The study found that AMD software is riddled with bugs that make training AI models nearly impossible without significant debugging. So, while AMD works to ensure the quality and ease of use of its accelerators, Nvidia continues to widen the gap by rolling out new features, libraries and improving the performance of its solutions.
After extensive testing, including GEMM tests and single-node training, the researchers concluded that AMD is unable to overcome what they call the “impregnable CUDA moat” – the strong software advantage that Nvidia accelerators have.
The AMD MI300X looks impressive on paper: 1307 teraflops in FP16 calculations and 192 GB of HBM3 memory. For comparison, Nvidia H100 accelerators have a performance of 989 teraflops and only have 80 GB of memory. However, the new generation of Nvidia H200 AI accelerators, with configurations up to 141 GB of memory, is closing the gap in available memory buffer. In addition, systems based on AMD accelerators also offer lower total cost of ownership due to lower system prices and more affordable network infrastructure support.
However, these advantages mean little in practice. According to SemiAnalysis, comparing bare specs is like “comparing cameras by simply checking the megapixel count of one versus the other.” AMD, analysts say, is thus “just playing with numbers,” but its solutions do not provide a sufficient level of performance in real tasks.
The researchers note that they had to work directly with AMD engineers to fix numerous bugs in the software to obtain evaluable test results. At the same time, systems based on Nvidia accelerators worked smoothly and without any additional settings.
«
A particularly telling case for SemiAnalysis was when TensorWave, the largest provider of AMD GPU-based cloud solutions, was forced to give AMD’s engineering team free access to its GPUs—the same hardware that TensorWave purchased from AMD—just to troubleshoot software issues. provision.
To solve the problems, SemiAnalysis experts recommend AMD CEO Lisa Su to invest more actively in software development and testing. Specifically, they propose dedicating thousands of MI300X chips to automated testing (a similar approach Nvidia follows for its accelerators), simplifying complex environment variables while introducing more efficient default settings for accelerators. “Make the finished experience usable!” – experts call.
Representatives of SemiAnalysis admit in their report that they wish AMD success in competition with Nvidia, but note that “unfortunately, much remains to be done for this.” Without significant software improvements, AMD risks falling further behind as Nvidia prepares to mass release its next generation of Blackwell accelerators. Although, according to reports, this process is also not going entirely smoothly for Nvidia.