Developer Tim Zaman, who worked at Twitter during the sale of the social network to Elon Musk, and has now moved to Google DeepMind, spoke about an unusual find, Tom’s Hardware reports. According to him, a few weeks after the deal, experts discovered a cluster of 700 idle NVIDIA V100 accelerators. Zaman himself described the discovery as “an honest attempt to build a cluster within the framework of Twitter 1.0.” Zaman was reminded of this event by the news about the xAI AI supercomputer consisting of 100 thousand NVIDIA H100 accelerators.

The discovery makes me sad that for years Twitter had 700 high-performance accelerators based on NVIDIA Volta architecture at its disposal, which were turned on but idle. They were in short supply at the time of release in 2017, and Zaman only discovered the dormant cluster in 2022. It is not surprising that around the same time it was decided to close some of the social network’s data centers. It is noteworthy that the cluster used PCIe cards, and not the SXM2 version of the V100 with NVLink, which are much more efficient in AI tasks.

Image source: Alexander Shatov/unsplash.com

Zaman also shared his thoughts about the “AI Gigafactory”. He suggested that using 100 thousand accelerators within one network fabric should be an epic challenge, since at such a scale failures are inevitable, which must be properly managed to maintain the functionality of the entire system. In his opinion, the system should be divided into independent domains (large clusters are designed this way). Zaman also wondered what the maximum number of accelerators could be within a single cluster. As companies build ever larger AI training systems, there will be both predictable and unexpected limits to how many accelerators can be combined.

Leave a Reply

Your email address will not be published. Required fields are marked *