Meta presented its version of the NVIDIA GB200 NVL72 super accelerator

Meta✴ shared its innovations in the field of hardware infrastructure and explained exactly how it sees the future of open AI platforms. In her presentation, Meta✴ talked about the new AI platform, new rack designs, including options with increased power supply, as well as innovations in the field of network infrastructure.

Image source: Meta✴

The company currently uses the Llama 3.1 405B neural network. The context window of this LLM reaches 128 thousand tokens, but the total number of tokens is over 15 trillion. To train such models, very serious resources and deep optimization of the entire software and hardware stack are required. A cluster of 16 thousand NVIDIA H100 accelerators, one of the first of this scale, participated in training the basic Llama 3.1 405B model. But Meta✴ already uses two clusters, each with 24 thousand accelerators, to train AI models.

Projects of this scale depend on more than just accelerators. The problems of power supply, cooling and, most importantly, interconnection come to the fore. Over the next few years, Meta✴ expects speeds in the region of 1 TB/s per accelerator. All this will require a new, even denser architecture, which, according to Meta✴, should be based on open hardware standards.

One of the new products was the Catalina platform. This is an Orv3 rack, the heart of which is NVIDIA GB200 hybrid processors. The rack belongs to the HPR (High Power Rack) class and is designed for 140 kW. Microsoft and Meta✴ are currently working on a modular and scalable Mount Diablo power system. Microsoft also has its own version of the GB200 NVL72. Meta✴ also updated the Grand Teton AI servers, first introduced in 2022. These are still monolithic systems, but now they support not only NVIDIA accelerators, but also AMD Instinct MI300X and future MI325X.

The interconnect of future platforms will be the DSF (Disaggregated Scheduled Fabric) network. By moving to open standards, the company plans to avoid limitations associated with scaling, dependence on hardware vendors and power density. DSF is based on the OCP-SAI standard and Meta✴ FBOSS OS for switches. The hardware is based on a standard Ethernet/RoCE interface.

Meta✴ has already developed and manufactured new 51T class switches based on Broadcom and Cisco silicon, as well as FBNIC network adapters created with the support of Marvell. FBNIC can have up to four 100GbE ports. The PCIe 5.0 interface is used, and it can work as four separate slices. The new product complies with the open standard OCP NIC 3.0 v1.2.0.

admin

Share
Published by
admin

Recent Posts

Apple CEO Promises Trump to Invest Hundreds of Millions of Dollars in Developing Manufacturing in the U.S.

The directness of the current US President Donald Trump sometimes creates inconvenience for his partners,…

3 hours ago

Apple Confirms It Will Soon Make Vision Pro Headsets More Comfortable and Smarter

Apple has officially confirmed that its generative AI platform, Apple Intelligence, will be coming to…

8 hours ago

OpenAI Purges ChatGPT of Suspected Malicious Accounts from China and North Korea

OpenAI has suspended accounts of users in China and North Korea who allegedly used the…

9 hours ago

“We Just Need More Power”: OpenAI Will Gradually Overcome Its Dependence on Microsoft

OpenAI currently relies heavily on the computing power of its major shareholder Microsoft to develop…

9 hours ago