Cerebras Systems, in collaboration with the US Department of Energy (DOE) Sandia National Laboratories (SNL), conducted a successful experiment to train an AI model with 1 trillion parameters using a single CS-3 system with a WSE-3 czar accelerator and 55 TB of MemoryX external memory.
Training models of this scale typically requires thousands of GPU-based accelerators that consume megawatts of power, dozens of experts, and weeks of hardware and software tuning, Cerebras says. However, SNL scientists were able to train the model on a single system without making changes to either the model or the infrastructure software. Moreover, they were able to achieve almost linear scaling – 16 CS-3 systems showed a 15.3-fold increase in learning speed.
A model of this scale requires terabytes of memory, thousands of times more than is available on a single GPU. In other words, classical clusters of thousands of accelerators must be correctly connected to each other before training begins. Cerebras systems for storing scales use external MemoryX memory based on 1U nodes with the most common DDR5, making it as easy to train a model with a trillion parameters as a small model on a single accelerator, the company says.
Previously, SNL and Cerebras deployed the Kingfisher cluster based on CS-3 systems, which will be used as a test platform for the development of AI technologies for national security.