NVIDIA once again showed leading results in the MLPerf Inference AI benchmark

NVIDIA reported that its platforms showed the highest results in all data center performance tests in the MLPerf Inference v4.1 benchmark, where the Blackwell family of accelerators made their debut.

The NVIDIA B200 accelerator (SXM, 180 GB HBM) turned out to be four times more productive than the H100 on the largest workload among large language models (LLM) MLPerf – Llama 2 70B – thanks to the use of the second generation Transformer Engine and FP4 inference on Tensor cores. However, it is the B200 that customers may not wait for.

The NVIDIA H200 accelerator, which became available on the CoreWeave cloud, as well as on ASUS, Dell, HPE, QTC and Supermicro systems, showed the best results in all tests in the data center category, including the latest addition to the benchmark, LLM Mixtral 8x7B with a total of 46 parameters. 7 billion and 12.9 billion active parameters per token using the Mixture of Experts (MoE) architecture.

Image source: NVIDIA

As NVIDIA noted, MoE has gained popularity as a way to bring greater versatility to LLM by allowing it to answer a wider range of questions and perform more diverse tasks within a single deployment. The architecture is also more efficient because only a few experts per inference are activated – meaning that such models produce results much faster than high-density (Dense) models of a similar size.

NVIDIA also notes that as the size of models grows, to reduce response time during inference, combining several accelerators becomes mandatory. According to the company, NVLink and NVSwitch already in the NVIDIA Hopper generation provide significant advantages for cost-effective real-time LLM inference. And the Blackwell platform will further expand the capabilities of NVLink, allowing up to 72 accelerators to be combined.

Image source: NVIDIA

At the same time, the company once again recalled the importance of the software ecosystem. Thus, in the latest round of MLPerf Inference, all major NVIDIA platforms demonstrated a sharp increase in performance. For example, NVIDIA H200 accelerators showed a 27% increase in generative AI inference performance compared to the previous round. And Triton Inference Server demonstrated almost the same performance as bare-metal platforms.

Finally, with software optimizations in this MLPerf round, the NVIDIA Jetson AGX Orin platform achieved over 6.2x throughput improvement and 2.5x latency improvement over the previous round on the GPT-J LLM workload. According to NVIDIA, Jetson is capable of locally processing any transformer model, including LLM, Vision Transformer class models and, for example, Stable Diffusion. And instead of developing highly specialized models, you can now use the universal GPT-J-6B model for natural language processing at the edge.

admin

Share
Published by
admin

Recent Posts

Microsoft’s Worst Product Has Been Secretly Protecting Windows From Piracy For Years

In the mid-1990s, Microsoft attempted to introduce a new interface, Microsoft Bob (Utopia), to replace…

12 minutes ago

Pixel Watch Smartwatch Can Warn You About Fraudulent Calls

In March, Google Pixel 9 series smartphones received a new feature — real-time scam detection.…

2 hours ago

Elon Musk’s net worth falls below $300 billion for the first time since November

Elon Musk's political activity after Donald Trump came to power in the United States was…

2 hours ago

Insider reveals name of Star Wars tactical strategy game from former XCOM developers

While the tactical strategy based on Star Wars from the American studio Bit Reactor and…

2 hours ago

Google will soon release a standalone app for one of its most powerful AI tools.

Google's Gemini Pro-powered AI notebook NotebookLM, currently only available for desktop users, is set to…

2 hours ago

WinRAR fixes vulnerability that allowed hackers to bypass Windows protection

Experts have discovered a vulnerability in the WinRAR archiver that allows attackers to bypass the…

2 hours ago