Meta presented its version of the NVIDIA GB200 NVL72 super accelerator

Meta✴ shared its innovations in the field of hardware infrastructure and explained exactly how it sees the future of open AI platforms. In her presentation, Meta✴ talked about the new AI platform, new rack designs, including options with increased power supply, as well as innovations in the field of network infrastructure.

Image source: Meta✴

The company currently uses the Llama 3.1 405B neural network. The context window of this LLM reaches 128 thousand tokens, but the total number of tokens is over 15 trillion. To train such models, very serious resources and deep optimization of the entire software and hardware stack are required. A cluster of 16 thousand NVIDIA H100 accelerators, one of the first of this scale, participated in training the basic Llama 3.1 405B model. But Meta✴ already uses two clusters, each with 24 thousand accelerators, to train AI models.

Projects of this scale depend on more than just accelerators. The problems of power supply, cooling and, most importantly, interconnection come to the fore. Over the next few years, Meta✴ expects speeds in the region of 1 TB/s per accelerator. All this will require a new, even denser architecture, which, according to Meta✴, should be based on open hardware standards.

One of the new products was the Catalina platform. This is an Orv3 rack, the heart of which is NVIDIA GB200 hybrid processors. The rack belongs to the HPR (High Power Rack) class and is designed for 140 kW. Microsoft and Meta✴ are currently working on a modular and scalable Mount Diablo power system. Microsoft also has its own version of the GB200 NVL72. Meta✴ also updated the Grand Teton AI servers, first introduced in 2022. These are still monolithic systems, but now they support not only NVIDIA accelerators, but also AMD Instinct MI300X and future MI325X.

The interconnect of future platforms will be the DSF (Disaggregated Scheduled Fabric) network. By moving to open standards, the company plans to avoid limitations associated with scaling, dependence on hardware vendors and power density. DSF is based on the OCP-SAI standard and Meta✴ FBOSS OS for switches. The hardware is based on a standard Ethernet/RoCE interface.

Meta✴ has already developed and manufactured new 51T class switches based on Broadcom and Cisco silicon, as well as FBNIC network adapters created with the support of Marvell. FBNIC can have up to four 100GbE ports. The PCIe 5.0 interface is used, and it can work as four separate slices. The new product complies with the open standard OCP NIC 3.0 v1.2.0.

admin

Share
Published by
admin

Recent Posts

One in five PCs now has an AI accelerator, but that’s not why people buy them

Shipments of desktop and mobile computers with accelerators for artificial intelligence applications reached 13.2 units…

4 hours ago

The “best alternative” to Chrome has been released on Android – the Arc Search browser

The Browser Company has announced the availability of its Arc Search browser to all Android…

4 hours ago

AMD showed how Ryzen AI 300 destroys Intel Lunar Lake in games – there were some tricks

AMD boasted on its official blog about the incredible gaming performance of its Ryzen AI…

5 hours ago

Google and NVIDIA showed the first results of TPU v6 and B200 in the MLPerf Training AI benchmark

NVIDIA's Blackwell accelerators outperformed H100 chips in the MLPerf Training 4.1 benchmarks by more than…

5 hours ago

TSMC found a bomb on the territory of the future chip plant

An unexploded rusty bomb was discovered at the construction site of a new TSMC plant…

6 hours ago

Nvidia is targeting the market for humanoid robots – it will provide them with “brains”

Nvidia will launch Jetson Thor, a new computing platform for humanoid robots, in the first…

7 hours ago