Categories: Technology and IT market. news

Not so simple and not so fast: scientists studied the features of memory and NVLink C2C in NVIDIA Grace Hopper

ahr0chm6ly8zzg5ld3mucnuvyxnzzxrzl2v4dgvybmfsl2lsbhvzdhjhdglvbnmvmjaync8wny8yni8xmta4ntyxl2doml84mdaucg5n

The NVIDIA Grace Hopper hybrid accelerator combines CPU and GPU modules, which are connected via the NVLink C2C interconnect. But, as HPCWire reports, there are some nuances in the structure and operation of the superchip, which were described by Swedish researchers.

They were able to measure the performance of the Grace Hopper memory subsystems and the NVLink interconnect in real-life scenarios in order to compare the results obtained with the characteristics declared by NVIDIA. Let us remind you that the speed of 900 GB/s was initially stated for interconnection, which is seven times higher than the capabilities of PCIe 5.0. The HBM3 memory as part of the GPU part has a bandwidth of up to 4 TB/s, and the version with HBM3e already offers up to 4.9 TB/s. The processor part (Grace) uses LPDDR5x with memory bandwidth up to 512 GB/s.

In the hands of the researchers was the basic version of Grace Hopper with 480 GB LPDDR5X and 96 GB HBM3. The system ran Red Hat Enterprise Linux 9.3 and used CUDA 12.4. In the STREAM benchmark, researchers were able to obtain the following bandwidth indicators: 486 GB/s for the CPU and 3.4 TB/s for the GPU, which is close to the stated characteristics. However, the resulting speed of NVLink-C2C was only 375 GB/s in the host-to-device direction and only 297 GB/s in the reverse direction. The total output is 672 GB/s, which is far from the stated 900 GB/s (75% of the theoretical maximum).

Source: NVIDIA

Grace Hopper, by design, offers two types of tables for memory pages: a system-wide one (4 KB or 64 KB pages by default), which covers the CPU and GPU, and an exclusive one for the GPU part (2 MB). In this case, the speed of initialization depends on where the request comes from. If memory initialization occurs on the CPU side, then the data is by default placed in LPDDR5x, to which the GPU part has direct access via NVLink C2C (without migration), and the memory table is visible to both the GPU and CPU.

Source: arxiv.org

If the memory is managed not by the OS, but by CUDA, then initialization can be immediately organized on the GPU side, which is usually much faster, and the data can be placed in HBM. In this case, a single virtual address space is provided, but there are two memory tables, for the CPU and GPU, and the mechanism for exchanging data between them involves page migration. However, despite the presence of NVLink C2C, the ideal situation remains when HBM is enough for GPU loads, and LPDDR5x is enough for CPU loads.

Source: arxiv.org

The researchers also addressed the issue of performance when using memory pages of different sizes. 4 KB pages are usually used by the processor part with LPDDR5X, and also in cases where the GPU needs to receive data from the CPU via NVLink-C2C. But as a rule, in HPC workloads it is optimal to use 64 KB pages, which require fewer resources to manage. When memory access is chaotic and inconsistent, 4 KB pages allow for finer control of resources. In some cases, a 2x performance benefit is possible by not moving unused data across 64 KB pages.

The published work notes that further research will be required to gain a deeper understanding of the mechanisms of unified memory in heterogeneous solutions like Grace Hopper.

admin

Next Nobody Wants to Die is a classic noir a few centuries later. Review »

Previous « Nvidia is preparing a new mobile GeForce RTX 3050 on the Ada Lovelace chip with a 64-bit bus and 4 GB of memory

Study: Apple C1 mobile modem falls short of Qualcomm modems in terms of connection quality in difficult conditions

A study by Cellular Insights Inc. found that Qualcomm's mobile modems perform better than Apple's…

13 hours ago

Sem categoria

Tesla Warns Trump Administration of Chip Tariffs

Tesla has called on the Trump administration to exercise caution in imposing tariffs on imported…

13 hours ago

Sem categoria

To better compete with OpenAI, Meta will split its AI team into two

Meta✴ will split its AI teams to better compete with OpenAI and Google, as well…

13 hours ago

Sem categoria

The Order: 1886 Director Co-Founds New Studio — Atlantis Studio Aims to Conquer the Industry with Innovative Games

Ru Weerasuriya, co-founder of Ready at Dawn, which closed last summer, and creative director of…

13 hours ago

Sem categoria

Review of the wireless speaker “Yandex Station Street”: Alice in the cities

To be honest, when I first saw the news about the release of the portable…

13 hours ago

Sem categoria

Blacktail developers announce Davy x Jones — a shooter about the headless pirate Davy Jones in the afterlife of sailors

Polish studio Parasight, known for the folklore action game Blacktail about the young Baba Yaga,…

2 days ago

Not so simple and not so fast: scientists studied the features of memory and NVLink C2C in NVIDIA Grace Hopper

Recent Posts

Study: Apple C1 mobile modem falls short of Qualcomm modems in terms of connection quality in difficult conditions

Tesla Warns Trump Administration of Chip Tariffs

To better compete with OpenAI, Meta will split its AI team into two

The Order: 1886 Director Co-Founds New Studio — Atlantis Studio Aims to Conquer the Industry with Innovative Games

Review of the wireless speaker “Yandex Station Street”: Alice in the cities

Blacktail developers announce Davy x Jones — a shooter about the headless pirate Davy Jones in the afterlife of sailors