Huawei announced its own CloudMatrix 384 super accelerator at the Huawei Cloud Ecosystem Conference 2025, which is positioned as a domestic alternative to the NVIDIA GB200 NVL72 system. Huawei’s solution has higher overall performance – 300 Pflops versus 180 Pflops. But at the same time, it is inferior to NVIDIA’s solution in terms of performance per chip and has significantly higher power consumption, writes SemiAnalysis.

Huawei CloudMatrix 384 uses 384 Huawei Ascend 910C accelerators, while the GB200 NVL72 uses 36 Grace processors combined with 72 B200 (Blackwell) accelerators. That is, to outperform the GB200 NVL72 in performance by two times, it took about five times more Ascend 910C accelerators, which is not very good in terms of the accelerator utilization itself, but excellent at the system deployment level, SemiAnalysis noted. According to SemiAnalysis, Huawei lags behind NVIDIA by a generation in chip performance, but is ahead in the design and deployment of scalable systems.

Image source: TechPowerUp

When comparing individual accelerators, NVIDIA GB200 clearly outperforms Huawei Ascend 910C, delivering more than three times the BF16 computing performance (2500 vs. 780 TFLOPS) and larger on-chip HBM (192 vs. 128 GB) with higher memory bandwidth (8 vs. 3.2 TB/s). In other words, NVIDIA has an advantage in raw power and at the chip level.

But at the system level, the CloudMatrix CM384’s efficiency comes out on top. It delivers 1.7x more petaflops, has 3.6x more HBM, provides 2.1x more memory bandwidth, and integrates more than five times more accelerators than the GB200 NVL72. However, this scalability comes at a cost, as the Huawei system consumes almost four times more power — 145 kW versus ~560 kW. The Huawei CloudMatrix 384 requires 3.9x more power than the GB200 NVL72: 2.3x more power per petaflop, 1.8x more power per TB/s memory bandwidth, and 1.1x more power per TB of HBM.

SCMP, citing Huawei’s own data, reports that CloudMatrix CM384 demonstrated performance at the level of 800 Pflops in BF16 calculations without sparsity or 1920 tokens/s on the DeepSeek-R1 model. The superaccelerator is housed in 16 racks, four of which are reserved only for interconnect – a total of 6912 400G ports. The remaining racks contain 32 Ascend 910C accelerators in four nodes (8×4) and a ToR switch.

As SemiAnalysis noted, it would be misleading to say that the Ascend 910C and CloudMatrix 384 are made in China: the HBM in them is from Samsung, the wafers are from TSMC, and the hardware itself is from the US, the Netherlands, and Japan. Although China’s SMIC already has a 7nm process, the vast majority of the Ascend 910B/910C were secretly made using TSMC’s 7nm technology. Huawei was allegedly able to circumvent US sanctions by ordering $500 million worth of chips through Sophgo. TSMC itself stopped supplying Huawei in 2020.

Leave a Reply

Your email address will not be published. Required fields are marked *