Categories: Technology and IT market. news

MLPerf benchmark shows AMD Instinct accelerators are not inferior to NVIDIA H200

The MLCommons consortium has published the results of testing various hardware solutions in the MLPerf benchmark for machine learning (ML), as reported by IEEE Spectrum. It noted that NVIDIA accelerators with the Blackwell architecture outperformed all other chips, but the latest version of the Instinct accelerators from AMD – Instinct MI325X – turned out to be on the level of the competing solution NVIDIA H200. Comparable results were obtained mainly in tests of one of the small-scale large language models (LLM) – Llama2 70B. To better reflect the peculiarities of ML development, the consortium added three new MLPerf tests.

MLPerf launched a benchmark for ML systems to provide an apples-to-apples comparison of computer systems. The authors use their own software and hardware, but the underlying neural networks must be the same. There are currently 11 server benchmarks, with three added this year.

Image source: IEEE Spectrum

Miro Hodak, co-chairman of MLPerf Inference, noted that the AI industry is evolving rapidly, and to keep up, they have had to “accelerate the pace of introducing new benchmarks into the space.”

Two benchmarks have been added to LLM. The popular and relatively compact Llama2 70B is already an established MLPerf benchmark, but the consortium decided to include a benchmark that simulates the responsiveness that users expect from chatbots. So a new benchmark, Llama2-70B Interactive, has been added, which tightens the hardware requirements: computers must be able to deliver at least 25 tokens per second with a response latency of no more than 450 ms.

Given the growing popularity of “agent AI,” MLPerf decided to add LLM testing with the characteristics needed for such tasks. In the end, Llama3.1 405B was chosen. This model has a wide context window of 128,000 tokens, which is 30 times larger than Llama2 70B.

The third new benchmark, RGAT, is a graph attention network. It classifies information in a network. For example, the RGAT test dataset consists of research articles linked by authors, institutions, and research areas, amounting to 2 TB of data. RGAT must classify articles into nearly 3,000 topics.

This time, NVIDIA and 15 partner companies submitted requests for testing, including Dell, Google, and Supermicro. Both NVIDIA’s first- and second-generation Hopper accelerators, the H100 and H200, performed well. “We’ve been able to add another 60% performance over the last year,” says Dave Salvator, director of accelerated computing products at NVIDIA, on Hopper, which launches in 2022. “It still has some performance left.” However, the Blackwell-based B200 emerged as the leader. “The only thing faster than Hopper is Blackwell,” says Salvator. The B200 has 36% more HBM memory than the H200, but more importantly, it can perform key machine learning math operations using numbers with just 4-bit precision, compared to Hopper’s 8-bit precision. Lower precision compute units are smaller in size and therefore fit better on the GPU, allowing for faster AI computations.

In the Llama3.1 405B test, Supermicro’s eight-B200 system delivered nearly four times as many tokens per second as Cisco’s eight-H200 system. And the same Supermicro system was three times faster than the fastest H200 machine in the interactive version of Llama2 70B.

NVIDIA used the GB200 superchip — a combination of Blackwell accelerators and Grace processors — to demonstrate how its NVL72 data paths can integrate multiple servers into a rack, acting as one giant GPU. In an unverified result the company shared with reporters, a full rack of GB200 NVL72-powered machines delivered 869,200 tokens per second in Llama2 70B. The fastest system in the current MLPerf round, an NVIDIA B200 server, delivered 98,443 tokens per second.

The Instinct MI325X accelerator is positioned by AMD as a competitor to the H200. It has the same architecture as its predecessor, the MI300, but is equipped with increased HBM memory and higher bandwidth – 256 GB and 6 TB/s (an increase of 33% and 13%, respectively). AMD optimized the software, which allowed it to increase the inference speed of DeepSeek-R1 by 8 times.

In the Llama2 70B test, computers with eight MI325Xs lagged behind similar H200-based systems by only 3-7%. In image generation tasks, the MI325X system performed within 10% of the H200 system.

AMD partner Mangoboost is also reported to have demonstrated a nearly fourfold performance increase in the Llama2 70B test, running the calculations on four computers.

Intel traditionally uses CPU-only systems in its benchmarks to show that some workloads don’t require GPUs. This time, it presented the first data from Intel Xeon 6 chips (formerly Granite Rapids), manufactured on Intel’s 3nm process. A system with two Xeon 6s scored 40,285 samples per second, about one-third the performance of a Cisco system with two NVIDIA H100s.

Compared to the Xeon 5 results from October 2024, the new processor shows an 80% increase in this test and even greater speedups in object detection and medical imaging tasks. Since 2021, when Intel began reporting Xeon results (with the Xeon 3), its processors have achieved an 11x performance gain in the ResNet test.

Intel has dropped out of the accelerator category: its H100 competitor, the Gaudi 3, does not appear in the current MLPerf results, nor in version 4.1, released in October 2024.

Google’s TPU v6e chip also showed off its capabilities, though the results were limited to the image generation task. At 5.48 queries per second, the four-TPU system showed a 2.5x performance gain over a similar computer using the TPU v5e in October 2024 results. Still, 5.48 queries per second is about the same as a similarly sized Lenovo computer with an NVIDIA H100.

admin