NVIDIA once again showed leading results in the MLPerf Inference AI benchmark

NVIDIA reported that its platforms showed the highest results in all data center performance tests in the MLPerf Inference v4.1 benchmark, where the Blackwell family of accelerators made their debut.

The NVIDIA B200 accelerator (SXM, 180 GB HBM) turned out to be four times more productive than the H100 on the largest workload among large language models (LLM) MLPerf – Llama 2 70B – thanks to the use of the second generation Transformer Engine and FP4 inference on Tensor cores. However, it is the B200 that customers may not wait for.

The NVIDIA H200 accelerator, which became available on the CoreWeave cloud, as well as on ASUS, Dell, HPE, QTC and Supermicro systems, showed the best results in all tests in the data center category, including the latest addition to the benchmark, LLM Mixtral 8x7B with a total of 46 parameters. 7 billion and 12.9 billion active parameters per token using the Mixture of Experts (MoE) architecture.

Image source: NVIDIA

As NVIDIA noted, MoE has gained popularity as a way to bring greater versatility to LLM by allowing it to answer a wider range of questions and perform more diverse tasks within a single deployment. The architecture is also more efficient because only a few experts per inference are activated – meaning that such models produce results much faster than high-density (Dense) models of a similar size.

NVIDIA also notes that as the size of models grows, to reduce response time during inference, combining several accelerators becomes mandatory. According to the company, NVLink and NVSwitch already in the NVIDIA Hopper generation provide significant advantages for cost-effective real-time LLM inference. And the Blackwell platform will further expand the capabilities of NVLink, allowing up to 72 accelerators to be combined.

Image source: NVIDIA

At the same time, the company once again recalled the importance of the software ecosystem. Thus, in the latest round of MLPerf Inference, all major NVIDIA platforms demonstrated a sharp increase in performance. For example, NVIDIA H200 accelerators showed a 27% increase in generative AI inference performance compared to the previous round. And Triton Inference Server demonstrated almost the same performance as bare-metal platforms.

Finally, with software optimizations in this MLPerf round, the NVIDIA Jetson AGX Orin platform achieved over 6.2x throughput improvement and 2.5x latency improvement over the previous round on the GPT-J LLM workload. According to NVIDIA, Jetson is capable of locally processing any transformer model, including LLM, Vision Transformer class models and, for example, Stable Diffusion. And instead of developing highly specialized models, you can now use the universal GPT-J-6B model for natural language processing at the edge.

admin

Share
Published by
admin

Recent Posts

Alibaba Cloud Reduces Data Center Assembly Time by 50% Using Modular Architecture

Alibaba Cloud presented at its annual Apsara conference a modular data center architecture called “CUBE…

4 mins ago

The release has crept up unnoticed: the classic version of Resident Evil 3 will appear on GOG very soon

The original Resident Evil 3: Nemesis turned 25 years old yesterday, and the digital distribution…

34 mins ago

Biden and Modi agreed to build a chip factory in India

The United States and India have reached an agreement under which a new semiconductor manufacturing…

1 hour ago

An insider has revealed the main source of inspiration for the multiplayer Assassin’s Creed Invictus – Fall Guys

Image Source: Mediatonic Among the available formats are team deathmatch, every man for himself, and…

3 hours ago

Seasonic has released a PRIME PX-2200 power supply with a power of 2200 W for $500

Seasonic has released the PRIME PX-2200 2200 W power supply. The new product was first…

3 hours ago