NVIDIA once again showed leading results in the MLPerf Inference AI benchmark

Aug 28, 2024

NVIDIA reported that its platforms showed the highest results in all data center performance tests in the MLPerf Inference v4.1 benchmark, where the Blackwell family of accelerators made their debut.

The NVIDIA B200 accelerator (SXM, 180 GB HBM) turned out to be four times more productive than the H100 on the largest workload among large language models (LLM) MLPerf – Llama 2 70B – thanks to the use of the second generation Transformer Engine and FP4 inference on Tensor cores. However, it is the B200 that customers may not wait for.

The NVIDIA H200 accelerator, which became available on the CoreWeave cloud, as well as on ASUS, Dell, HPE, QTC and Supermicro systems, showed the best results in all tests in the data center category, including the latest addition to the benchmark, LLM Mixtral 8x7B with a total of 46 parameters. 7 billion and 12.9 billion active parameters per token using the Mixture of Experts (MoE) architecture.

Image source: NVIDIA

As NVIDIA noted, MoE has gained popularity as a way to bring greater versatility to LLM by allowing it to answer a wider range of questions and perform more diverse tasks within a single deployment. The architecture is also more efficient because only a few experts per inference are activated – meaning that such models produce results much faster than high-density (Dense) models of a similar size.

NVIDIA also notes that as the size of models grows, to reduce response time during inference, combining several accelerators becomes mandatory. According to the company, NVLink and NVSwitch already in the NVIDIA Hopper generation provide significant advantages for cost-effective real-time LLM inference. And the Blackwell platform will further expand the capabilities of NVLink, allowing up to 72 accelerators to be combined.

Image source: NVIDIA

At the same time, the company once again recalled the importance of the software ecosystem. Thus, in the latest round of MLPerf Inference, all major NVIDIA platforms demonstrated a sharp increase in performance. For example, NVIDIA H200 accelerators showed a 27% increase in generative AI inference performance compared to the previous round. And Triton Inference Server demonstrated almost the same performance as bare-metal platforms.

Finally, with software optimizations in this MLPerf round, the NVIDIA Jetson AGX Orin platform achieved over 6.2x throughput improvement and 2.5x latency improvement over the previous round on the GPT-J LLM workload. According to NVIDIA, Jetson is capable of locally processing any transformer model, including LLM, Vision Transformer class models and, for example, Stable Diffusion. And instead of developing highly specialized models, you can now use the universal GPT-J-6B model for natural language processing at the edge.

Network news Technology and IT market. news

Elon Musk presented the DOGE concept: massive cuts in the US government apparatus and simplification of government regulation

Nov 23, 2024 admin

Comments on recent events Technology and IT market. news

Indonesian authorities believe Apple could pay more than $100 million to return the iPhone to the local market

Nov 23, 2024 admin

Network news Technology and IT market. news

Google Gemini will be able to perform tasks in applications without opening them

Nov 23, 2024 admin

NVIDIA once again showed leading results in the MLPerf Inference AI benchmark

Related Post

Elon Musk presented the DOGE concept: massive cuts in the US government apparatus and simplification of government regulation

Indonesian authorities believe Apple could pay more than $100 million to return the iPhone to the local market

Google Gemini will be able to perform tasks in applications without opening them

Leave a Reply Cancel reply

You missed

Elon Musk presented the DOGE concept: massive cuts in the US government apparatus and simplification of government regulation

Indonesian authorities believe Apple could pay more than $100 million to return the iPhone to the local market

Google Gemini will be able to perform tasks in applications without opening them

Elon Musk got off with $2,923 for failing to appear to testify in Twitter acquisition case