An unexpected problem has plagued Nvidia’s latest GB200 NVL72 and NVL36 server systems, which are equipped with the advanced GB200 compute accelerators, which are designed for artificial intelligence applications. Shortly before mass production and the launch of the product, a serious problem was discovered in the liquid cooling system.

Image source: NVIDIA

Let us recall that the GB200 NVL72 systems represent an entire server rack with 18 1U nodes at once, each of which has a pair of GB200 accelerators, which, in turn, are a pair of Nvidia B200 chips and one 72-core Arm Grace processor. In total, the system includes 72 B200 chips, 36 Grace processors, connected by the NVLink 5 bus. This entire system consumes about 120 kW, is equipped with a life support system and a single DC power bus. In turn, the GB200 NVL36 system is a system with half the number of GB200. According to preliminary data, the GB200 NVL72 system will cost $3 million.

As TweakTown reports with reference to the Taiwanese publication UDN, leaks have been detected in the GB200 NVL72 liquid cooling systems, which, according to preliminary data, are associated with components from third-party manufacturers. Previously, Nvidia transferred the production of some cooling system components, such as pipes, quick connectors and hoses, to its partners – large international manufacturers.

Image Source: TheRegister.com

The leaks were discovered before mass production of the NVL36 and NVL72 AI systems began, giving manufacturers time to fix the problems and, despite the difficulties encountered and the threat of missed delivery dates to key customers, the product is expected to be delivered on time.

However, the incident has raised concerns among major cloud service providers who fear the reliability of Nvidia’s new servers. In response to the situation, Taiwanese manufacturers such as Shuanghong and Qihong have begun to ramp up production of liquid cooling components to provide Nvidia with alternative options.

Certification of pipes, quick-release couplings and hoses is a complex process that requires special knowledge and experience. Previously, Taiwanese companies did not specialize in the production of such components, but Nvidia’s decision to use liquid cooling in its AI chips pushed them to develop new technologies. Currently, active work is underway to eliminate the problem. It is expected that server cabinets with GB200 processors and the corrected cooling system will begin to be shipped to customers in the near future.

Leave a Reply

Your email address will not be published. Required fields are marked *