NVIDIA reported that its platforms showed the highest results in all data center performance tests in the MLPerf Inference v4.1 benchmark, where the Blackwell family of accelerators made their debut.
The NVIDIA B200 accelerator (SXM, 180 GB HBM) turned out to be four times more productive than the H100 on the largest workload among large language models (LLM) MLPerf – Llama 2 70B – thanks to the use of the second generation Transformer Engine and FP4 inference on Tensor cores. However, it is the B200 that customers may not wait for.
The NVIDIA H200 accelerator, which became available on the CoreWeave cloud, as well as on ASUS, Dell, HPE, QTC and Supermicro systems, showed the best results in all tests in the data center category, including the latest addition to the benchmark, LLM Mixtral 8x7B with a total of 46 parameters. 7 billion and 12.9 billion active parameters per token using the Mixture of Experts (MoE) architecture.
As NVIDIA noted, MoE has gained popularity as a way to bring greater versatility to LLM by allowing it to answer a wider range of questions and perform more diverse tasks within a single deployment. The architecture is also more efficient because only a few experts per inference are activated – meaning that such models produce results much faster than high-density (Dense) models of a similar size.
NVIDIA also notes that as the size of models grows, to reduce response time during inference, combining several accelerators becomes mandatory. According to the company, NVLink and NVSwitch already in the NVIDIA Hopper generation provide significant advantages for cost-effective real-time LLM inference. And the Blackwell platform will further expand the capabilities of NVLink, allowing up to 72 accelerators to be combined.
At the same time, the company once again recalled the importance of the software ecosystem. Thus, in the latest round of MLPerf Inference, all major NVIDIA platforms demonstrated a sharp increase in performance. For example, NVIDIA H200 accelerators showed a 27% increase in generative AI inference performance compared to the previous round. And Triton Inference Server demonstrated almost the same performance as bare-metal platforms.
Finally, with software optimizations in this MLPerf round, the NVIDIA Jetson AGX Orin platform achieved over 6.2x throughput improvement and 2.5x latency improvement over the previous round on the GPT-J LLM workload. According to NVIDIA, Jetson is capable of locally processing any transformer model, including LLM, Vision Transformer class models and, for example, Stable Diffusion. And instead of developing highly specialized models, you can now use the universal GPT-J-6B model for natural language processing at the edge.