AMD demonstrated the advantages of a chiplet layout back in the first generation EPYC (Naples), but in subsequent generations the homogeneous layout was replaced by a heterogeneous one, with a separate chiplet responsible for all needs related to I/O operations.

This was the case in Rome, this was the case in Milan, Genoa and Bergamo, and now it’s time to see what has changed in the EPYC 9005 (Turin) announced just the other day, and whether this will be enough to once again consolidate the title of leader in the field of creating multi-core server decisions.

Source: AMD.

First of all, this is, of course, the arrival of the fifth generation Zen computing architecture in the EPYC series, which some time ago already debuted in consumer-class AMD Ryzen processors. As you know, in Zen 5 AMD did a good job of increasing efficiency – the IPC (the number of instructions executed per clock) increased by about 17%. Behind this lies quite serious changes in the microarchitecture, which, however, are of an evolutionary nature.

The Zen 5 core received a new branch prediction unit, a unified scheduler, instruction fetch and decoding units were divided into two clusters to optimize SMT (it is interesting to compare with the approach of Intel, which is moving towards abandoning SMT altogether).

The first level cache has been accelerated and increased in volume, the address translation tables have grown in size, and the computing part itself has expanded with the increase in the number of supported instructions. In particular, Zen 5 implements full support for AVX-512 with fair 512-bit data processing.

But let’s return to the new EPYC 9005. The leadership of Intel Xeon 6 (Granite Rapids and Sierra Forest) in terms of the number of cores did not last very long: AMD regained its leadership, guided by a simple formula: 50% increase in the number of cores + 25% increased thermal package + transition to the Zen 5/5c architecture while maintaining compatibility with the existing hardware ecosystem.

Like the previous generation EPYC (Genoa and Bergamo), the new Turin processors use the SP5 socket (LGA-6096), designed for a 12-channel memory subsystem and 128 PCI Express 5.0 lanes. In the case of a dual-processor motherboard layout, some of the latter are used for interprocessor communication.

Interestingly, this time a separate name is not used for the high-density variant of the processor: the EPYC 9005 versions do have different CCD chiplet layouts with Zen 5 and Zen 5c cores, as well as different model identifiers (00-0Fh and 10h-1Fh, respectively), but They have the same code name, although there is also a variant called Turin Dense.

Previously published information about 16 eight-core chiplets for the classic version and 12 sixteen-core chiplets for the high-density Turin version has been confirmed. The chiplets are indeed grouped into four and three blocks (quadrants), respectively.

The chiplets themselves have been transferred to TSMC 4 and 3 nm manufacturing processes, which made it possible to achieve another increase in the number of cores. According to this indicator, the high-density version of Turin even broke the 128-core barrier, which is the first time for an x86 processor.

The internal structure of CCX has significant differences: in fact, CCX is defined by a single L3 cache section, and it is the same for Zen 5 and Zen 5c based variants and has a capacity of 32 MB. In other words, each Zen5c core potentially has less cache (2 MB versus 4), but this is the price for a denser layout.

The IOD (I/O Die) block is still one, although it has been seriously redesigned to support a larger number of connected CCDs. The connection is made via GMI3 links, which operate at a frequency of 1.8 GHz. They are twice as fast as the GMI2 used in the EPYC 7003 series of processors.

There are two operating modes: GMI3-Narrow for configurations with 12 and 16 chiplets and GMI3-Wide for processors with only 8 CCDs active (the throughput from CCD to IOD doubles). Four xGMI links can be used to exchange data with a second processor in the case of a 2S configuration. The new IOD provides the flexibility to bifurcate SerDes lines and assign specific functions to them.

The IOD module provides Turin with support for 128 PCI Express 5.0 lanes in single-processor mode and up to 160 in dual-processor mode. Four x16 links can be used as 64 CXL 2.0 lanes (Type 1, 2, 3) and up to 32 IO lines can be configured as SATA interfaces. The latter interface is rapidly losing relevance today, and its support was implemented by AMD mainly for intra-platform compatibility.

AMD also tried to maximize the efficiency of I/O subsystems in principle, understanding the importance of the channels connecting the CPU with all kinds of accelerators in the era of AI and LLM. EPYC 9005 fully support DMA and P2P connections, while security is not forgotten: PCIe traffic encryption is implemented within the framework of SEV-SNP.

Also of interest is SDCI technology, which allows I/O devices to write data directly to the cache hierarchy, bypassing DRAM, which reduces the load on the memory subsystem and potentially increases the efficiency of data exchange between the processor and accelerators.

In the Xeon 6 processors, Intel has seriously outperformed AMD in terms of memory subsystem parameters: even Sierra Forest received support for DDR5-6400 (5200 at 2DPC), and Granite Rapids even support the newfangled MRDIMM DDR5-8800, reception with 12 channels. AMD EPYC processors were limited to supporting DDR5-4800 at best, albeit with a similar number of channels.

But the advent of Turin restores practical parity: there are still 12 memory channels, but the platform now supports DDR5-6000, and for some custom platforms AMD is going to allow support for DDR5-6400. EPYC 9005 does not support exotics like MCRDIMM/MRDIMM in its current form. Instead, the company plans to enable future EPYCs to support the new memory standard once it is ratified by JEDEC.

However, even without taking into account increased frequencies, which in themselves could provide a 20–25% increase in throughput, there are enough innovations: new memory controllers are significantly more efficient than old ones, they support x80 and x72 error-correcting modules, can retry reading UECC, support 3DS RDIMM with a total capacity of 6 TB per processor.

Peak throughput can reach 576 GB/s, which is higher than NVIDIA Grace (72 cores, 500 GB/s), but lower than that of the dual Grace Superchip (144 cores, 1 TB/s). Latency has not increased at all and is approximately the same 110 ns as that of the previous generation EPYC memory controllers working with DDR5-4800 modules.

As mentioned, the EPYC 9005 fully implements CXL 2.0 support for all three existing device types, but focuses on working with Type 3 devices as RAM expanders. Support for hierarchy levels is provided, the ability to combine CXL devices into a common NUMA domain, QoS functions with bandwidth sharing for DRAM and CXL memory, etc. AMD tried to ensure the highest possible performance of the CXL with minimal latency, but only tests can show how well these innovations work in comparison with, for example, Xeon Granite Rapids.

When it comes to NUMA, we cannot fail to mention that EPYC 9005 can operate in different modes depending on the NUMA Nodes Per Socket (NPS) value in the BIOS. A value of 0 on a two-socket system means a monolithic configuration with one NUMA domain for the entire system. The memory operates in interleaving mode as a single address space. Both processors have equal access to all memory and all physically connected PCIe/CXL devices.

A value of 1 gives two domains, 2 divides each processor into two domains, and 4 represents each “quadrant” of the processor as a separate NUMA domain, including 4 CCDs on Zen 5 and 3 CCDs on Zen 5c. The choice of setting depends on the usage scenario and the specific software being used.

Also worth noting is the expanded set of reliability and stability (RAS) tools. To those already implemented in the previous generation of EPYC, remote error processing was added through a dedicated channel (out of band error management) and automatic replacement of faulty DRAM cells with backup working ones. The list of supported RAS features is wide.

Unlike Intel, AMD presented 27 EPYC 9005 models at once, with the number of cores from 8 to 192. It should immediately be noted that the new technical processes and improved architecture had a very positive impact on the EPYC frequency formula: if previously the frequency in turbo mode rarely exceeded 4 GHz , then with the EPYC 9005 this is par for the course.

The only exceptions are models with Zen 5c cores, but even for them this parameter is 3.7 GHz, which is accompanied by corresponding thermal packages of 320–500 W. Note that in the latter case, an update to the platform’s power subsystems is required, since previously the maximum TDP did not exceed 400 W. The new EPYC series still includes individual models designed to work in single-processor systems; they have the suffix “P” in their names.

The suffix “F” marks models with an extended frequency formula, in which the lower limit is at least 3.1 GHz, and the upper limit approaches 5 GHz. These processors also have the maximum amount of L3 cache. Together with the frequencies, this makes the EPYC 9005F the optimal choice for scenarios with per-core software licensing.

The first test results of new AMD server processors have already been published online: for example, reviewers from the Phoronix resource managed to test three new products at once – EPYC 9755 (128 Zen 5 cores, 4.1 GHz turbo), 9575F (64 Zen 5 cores, 5 GHz turbo) and 9965 (192 Zen 5c cores, 3.7 GHz turbo). The results are impressive: the combination of Zen 5 in a server form with increased clock speeds did the trick, and AMD’s new products tightly occupied first places in almost all categories.

Source: Phoronix

In some places, for example, in the OpenSSL test, Turin Dense (EPYC 9965) performed excellently, because it is ahead of the Intel Xeon 6700E not only in the number of cores (192 versus 144), but also the cores themselves, despite density optimization, are is a full-fledged implementation of Zen 5, and not a simplified version of the “big architecture”, as is the case with Intel E-cores. As a result, the first three places belong to AMD solutions, and only the fourth place was claimed by a system with a Xeon 9680P equipped with high-speed MRDIMM-8800 modules. It, equipped with the usual DDR5-6400, was able to perform adequately only against a single EPYC 9755. For a dual-processor system with EPYC 9755, the gap from a similar platform based on Granite Rapids was on average 40%.

As for the high-density EPYC 9965, it is 45% ahead of the dual-processor assembly based on the flagship EPYC 9754 Bergamo, despite the smaller number of cores (192 versus 256). The new architecture and a significant increase in clock frequencies had an impact. In terms of power consumption, the new AMD products, of course, are not as impressive as the Intel Xeon 6700E (Sierra Forest), but not much worse, and the EPYC 9755, despite the monstrous thermal package of 500 W, still turned out to be more economical than the Xeon 6980P. Its power consumption ceiling was indeed 500 Watts, while Intel’s flagship consumed almost 550 Watts at its peak.

At the same time, the Intel card beats even where the “blue” ones have always been invincible – now AMD also has full support for AVX-512, as well as a 192-core answer to the 144-core Xeon 6700E. The capital investment for converting infrastructure from fourth generation to fifth generation EPYC may be relatively small. In most cases, the task is limited to flashing a new BIOS and replacing the processors themselves. Manufacturers of server equipment greeted AMD’s new products with enthusiasm and, apparently, the company’s share in the server market will continue to grow.

Of the trump cards remaining in Intel’s hands at the moment, we can only name the presence of specific accelerator blocks, as well as support for AMX matrix mathematics extensions. In some scenarios, such as telecommunications servers, this will help Xeon hold its position, but in most other workloads, EPYC Turin looks much better. Moreover, new AMD server processors are noticeably cheaper than Intel solutions.

Leave a Reply

Your email address will not be published. Required fields are marked *