The era of Intel lagging behind in the number of cores in server processors, ushered in by the debut of the AMD EPYC 7002 Rome five years ago, is gradually becoming a thing of the past. Back in the summer, the company introduced Xeon 6700E (Sierra Forest) chips with 144 cores. But if Sierra Forest could still be conditionally considered not quite “real” Xeon due to the use of exclusively energy-efficient E-cores without SMT, which are opposed to 128-core AMD EPYC Bergamo based on Zen4c with SMT, then with the announcement of the Xeon 6900P this can no longer be done it will work out. Meet Granite Rapids-AP with 128 full-weight P-cores!
Much about the architecture and capabilities of Granite Rapids was known even at the time of the announcement of Sierra Forest, but now we will talk about a full-fledged family, in which at the start five models of processors with a number of cores from 72 to 128 are presented. In this, the older new product is not inferior even to future 128- nuclear AMD EPYC Turin Dense based on Zen 5, the announcement of which is about to take place.
By leveraging sophisticated, high-performance cores, Intel is now able to claim a 2.1x performance advantage over the 96-core flagship AMD EPYC Genoa in OpenFOAM workloads, and with support for AMX and AI-specific compute formats such as FP16 in inference-ready workloads. ResNet50 scenarios superiority is 5.5 times.
Intel takes a very radical approach to the issue of the “Great Divide”, clearly distinguishing the optimal areas of application for processors with P-cores and E-cores, although from a software point of view the difference is not so great and actually rests on the lack of full support for AVX-512 for E-cores and AMX, which is partly offset by the advanced implementation of AVX. However, the company does not plan to release Xeon with a mixed configuration in the foreseeable future. Note also that P-cores are better suited in cases where maximum performance per thread is required, which is especially important for applications with per-core licensing.
The new Xeon 6900P series boasts Redwood Cove cores with SMT (HT) support, delivering 512 threads per dual-processor system, and a 12-channel memory controller designed to match and outperform AMD’s memory subsystems. It works with DDR5-6400 modules, but MRDIMM support allows us to talk about speeds of 8800 MT/s.
The cores include two execution units for the AVX-512 instruction set. The L1 cache consists of 64 KB for instructions and 48 KB for data. The decoder is designed to process 8 instructions per clock cycle, the same throughput of the pipeline in processing micro-operations. The out-of-order execution engine itself is twice as powerful as the one implemented in Sierra Forest, and can accommodate 512 instructions.
The volume of top-level caches is also impressive: each Redwood Cove core has its own 2 MB L2 cache, and the size of the shared L3 cache reaches 504 MB – only AMD EPYC Genoa-X with 3D V-Cache can offer more. In a dual-processor configuration, this gives over 1 GB of cache, which will certainly be useful in HPC and AI scenarios. Each Xeon 6900P processor has 96 PCI Express 5.0 lanes with CXL 2.0 support, covering all device types – Type I, II and III. Six UPI lines with a capacity of 24 GT/s are responsible for interprocessor communication.
The processors use the so-called “big socket” LGA 7529. Later, Xeon 6 with P-cores will be released for the “small socket” LGA 4710, currently used by Sierra Forest chips, but this will not happen earlier than the first quarter of 2025. These will be solutions with fewer cores (up to 86) and an eight-channel memory subsystem. At that time, solutions with E-cores on LGA 7529 will also be developed, but for now Sierra Forest and Granite Rapids exist exclusively within their platforms.
The layout of the Xeon 6900P, of course, is chiplet-based, which was also covered in the review of the Xeon 6700. However, now we know more: at the presentation of new products, Intel showed that chiplets with P-cores will be of two sizes: LCC (low core count), limited to 16 cores, and a larger HCC (high core count), containing up to 48 cores. The XCC (eXtreme core count) configuration is formed by the arrangement of two or three HCC chiplets. It is the latter option with three tiles that is used by all five Xeon 6900P models.
In fact, we can talk about the Xeon 6900P as a NUMA system with three nodes. Since DDR5 controllers are located in processor rather than I/O chiplets like AMD, support for different clustering modes allows in some scenarios to benefit in memory access latency, and therefore in performance. The Xeon 6900P has two such modes – HEX and SNC3. The latter implements the scenario “each chiplet accesses its own DDR controllers.”
It is interesting that a configuration is used with two tiles of 43 active cores each, while the third has only 42 active cores. This was done, of course, for 128 cores, but hypothetically nothing prevented Intel from activating more cores and getting a chip that outperforms EPYC in the number of cores. The reason, presumably, is the energy appetite of new products, which even with 128 cores already require half a kilowatt.
The Xeon 6900P has a specific pattern of latency during inter-core interaction. Communication between cores belonging to the outermost chiplets is not characterized by low latencies, which is natural. We can confidently predict slightly higher latencies when the average chiplet operates with external devices, since the I/O chiplets are located literally at the edges of the CPU.
The chips, as has been repeatedly noted, are more monolithic in implementation than those of AMD: chiplets are stitched together via EMIB and use a universal modular mesh network. Computing tiles are produced using the Intel 3 process technology, while I/O chiplets use the cheaper and more widespread Intel 7 process. Unlike a similar module in AMD EPYC, they are only responsible for implementing UPI/PCIe/CXL interfaces and communicating with accelerators DSA/IAA/QAT/DLB.
The new lineup opens with the 72-core Xeon 6960P, which can be called Intel’s answer to NVIDIA Grace (72 cores in the GH200, 144 cores in the Grace Superchip). However, you have to pay for everything, and in this case the payback is the heat package increased to 500 W. Only the 6952P model with a low base frequency has it slightly lower and amounts to 400 W. Otherwise, the 6900P are very similar, they even have the same turbo frequency of 3.9 GHz for a single core and 3.7 GHz for all cores. At the moment, this is higher than the AMD EPYC Genoa of similar “core”. All variants have four active accelerators of each type, 12 memory channels, six UPI lanes and 96 PCIe 5.0 lanes.
It’s worth mentioning separately about memory. The growth in the number of channels and memory capacity is much slower than the growth in the number of cores. This is especially true in light of the boom in AI technologies, which by their nature are very memory-hungry. There is also a problem with speeds: although it is partly brightened up by the transition to DDR5, even Sierra Forrest only supports DDR5-6400. Granite Rapids-AP also supports such memory, but the ability to work with MRDIMM-8800 modules looks more promising.
Externally, MRDIMMs (and MCR DIMMs) resemble conventional DDR5 register modules, but the use of an additional buffer allows the new modules to operate in read mode from two ranks at once, instead of one in conventional DIMMs. It is worth saying that JEDEC is only planning to publish the MRDIMM specification. So far, Intel has the advantage, but it’s too early to talk about mass production of the new type of memory. There is a possibility that AMD’s implementation of support for this memory will be different.
As for performance in general, the first results are also already available: in tests conducted by the Phoronix resource, the platform equipped with two flagship Xeon 6980P confidently took first place, ahead of dual-processor assemblies with AMD EPYC 9754, 9684X and 9654.
According to the figures given by Intel itself, the superiority of the sixth generation Xeon over the fifth (Emerald Rapids) is 2-3x in pure performance, and in terms of watt it ranges from 1.44x to 2.16x, depending on the type and nature of the load. It is worth noting that despite the increased thermal package, the energy efficiency of the Xeon 6900P has also increased: with a typical 40% server load, the new processors provide 1.9 times more performance per watt than the previous generation of Xeon.
Compared to AMD solutions with a comparable number of cores (EPYC 9754 and 9654, 128 and 96 cores, respectively), the Xeon 6980P is 1.2-3.21 times faster, with the least advantage manifested in working with integer data, and noticeably more – in floating point calculations, as well as in memory subsystem tests.
Let’s sum up the intermediate results. At one time, AMD managed to get ahead in the field of creating multi-core server processors precisely thanks to the transition to a chiplet layout. To a large extent, the triumph of the “reds” was facilitated by the obscenely protracted process of introducing 10-nm technology from Intel. As a result, for five whole years, EPYC successfully increased the number of cores, while the “blue” ones, stubbornly holding on to the solidity of the crystal, were also limited by the parameters of the technical process. This did not allow us to confidently cross the 64-core mark.
Only Sapphire Rapids managed to implement the chiplet layout option. Although from the very beginning it used a lower-level principle of stitching chiplets, until recently the not very successful homogeneous option with large universal chiplets was used. And only in the Xeon 6 generation did a heterogeneous version debut with the I/O part placed in separate chiplets, which, in combination with the mastered sophisticated technical processes, allowed Intel to make a breakthrough that reduced the gap with AMD.
Now Intel has not only a trump card in the form of high-performance architecture and specific accelerator units, but also simply “large” Xeons that can compete on equal terms with EPYC in the number of cores. The situation is also interesting because on October 10, AMD is preparing to officially present the new EPYC Turin Dense based on Zen 5, which will also receive 128 cores. But AMD is preparing not only them, but also 192-core EPYC based on Zen 5c. And as we know, sometimes quantity trumps quality.