Meta✴ shared its innovations in the field of hardware infrastructure and explained exactly how it sees the future of open AI platforms. In her presentation, Meta✴ talked about the new AI platform, new rack designs, including options with increased power supply, as well as innovations in the field of network infrastructure.
The company currently uses the Llama 3.1 405B neural network. The context window of this LLM reaches 128 thousand tokens, but the total number of tokens is over 15 trillion. To train such models, very serious resources and deep optimization of the entire software and hardware stack are required. A cluster of 16 thousand NVIDIA H100 accelerators, one of the first of this scale, participated in training the basic Llama 3.1 405B model. But Meta✴ already uses two clusters, each with 24 thousand accelerators, to train AI models.
Projects of this scale depend on more than just accelerators. The problems of power supply, cooling and, most importantly, interconnection come to the fore. Over the next few years, Meta✴ expects speeds in the region of 1 TB/s per accelerator. All this will require a new, even denser architecture, which, according to Meta✴, should be based on open hardware standards.
One of the new products was the Catalina platform. This is an Orv3 rack, the heart of which is NVIDIA GB200 hybrid processors. The rack belongs to the HPR (High Power Rack) class and is designed for 140 kW. Microsoft and Meta✴ are currently working on a modular and scalable Mount Diablo power system. Microsoft also has its own version of the GB200 NVL72. Meta✴ also updated the Grand Teton AI servers, first introduced in 2022. These are still monolithic systems, but now they support not only NVIDIA accelerators, but also AMD Instinct MI300X and future MI325X.
The interconnect of future platforms will be the DSF (Disaggregated Scheduled Fabric) network. By moving to open standards, the company plans to avoid limitations associated with scaling, dependence on hardware vendors and power density. DSF is based on the OCP-SAI standard and Meta✴ FBOSS OS for switches. The hardware is based on a standard Ethernet/RoCE interface.
Meta✴ has already developed and manufactured new 51T class switches based on Broadcom and Cisco silicon, as well as FBNIC network adapters created with the support of Marvell. FBNIC can have up to four 100GbE ports. The PCIe 5.0 interface is used, and it can work as four separate slices. The new product complies with the open standard OCP NIC 3.0 v1.2.0.