The smaller the manufacturing standards by which a chip is made, the higher its density of operations — the number of computational/logical operations that a unit of its area (a conventional square millimeter) can perform. In fact, for calculations related to generative models of artificial intelligence, this is all that is needed: we miniaturize the manufacturing process, regardless of costs, obtain increased chip performance, launch another model on it… and suddenly discover that the speed of performing standard operations after another cycle of changing manufacturing standards has increased by 10-15%, while the volume of operating parameters of the AI ​​model has increased by more than a decimal order. The discrepancy is obvious — something needs to be done about it.

Dependence of the characteristic area of ​​a single memory cell in conventional units f2 on the dimension of the manufacturing process used for its production. In addition to SRAM, data are provided for two types of magnetoresistive memory: the currently manufactured STT-MRAM (spin-transfer torque MRAM, based on the spin-transfer torque effect) and the promising SOT-MRAM (spin-orbit torque MRAM, based on the spin-orbit torque) (source: Objective Analysis)

⇡#Memory Catch-up

Let’s start with the fact that it is much more difficult to reduce the geometric dimensions of semiconductor memory cells than of logical circuits made up of transistors – we have already talked about this in detail. Modern microprocessors rely heavily on multi-level data caches to speed up their work, and such portable storage of the most frequently accessed information in the form of SRAM cells is located as close as possible to the logical circuits; ideally, right on the same silicon substrate. As a result, along with these very circuits, during the next cycle of miniaturization of production standards, the dimensions of the memory cells are also reduced. It would seem that no problems should arise here, since SRAM memory, unlike DRAM, does not contain capacitors, whose physical dimensions are immeasurably more difficult to reduce than the length, width, and even thickness of the semiconductor elements of transistors – we have already written about this. But if only everything were so simple in microelectronics!

Let us recall that the classic elementary cell of dynamic random access memory (DRAM) is formed by a pair of a capacitor and a transistor (structural diagram one-transistor, one-capacitor — 1T1C), due to which it is capable of both taking and giving off an electric charge quite quickly. At the heart of static random access memory (SRAM) is a transistor circuit with positive feedback. Due to the fact that the charge constantly moves along a circuit consisting of cross-connected inverters and two key transistors, the write and read speeds are even higher than those of DRAM. True, this is achieved at a considerable price: a typical SRAM cell today is a six-transistor (structural diagram 6T), i.e. quite large; in addition, a power failure here also leads to an immediate loss of the data bit stored in the cell.

A typical six-transistor SRAM cell: the bit lines (BL and BL with a dash) are used to write and read data; the WL line controls the access control transistors. Two inverter CMOS transistors, formed by pairs of parallel-connected transistors, form a 2OR logic circuit that redirects the output potential Vout of one of these pairs to the input of the other, Vin, and vice versa; thus, the data bit continuously “runs in a circle” (more precisely, in a figure eight) inside this circuit – as long as there is power (source: Wikimedia Commons)

It can be said that the information carrier in DRAM is the charge (or rather, its specific value stored in the capacitor – remember the most widespread multi-level cells of dynamic memory today), and in SRAM – the level of voltage circulating in the six-cell circuit. Actually, since the specific voltage value is determined by the parameters of the current supplied to the circuit, and while this current is flowing, the voltage is unchanged – that is why such memory is called static. In the case of dynamic memory – DRAM – the charge leaking from its capacitors for natural reasons needs to be replenished from time to time (this is what dynamics is). Typical access times to an SRAM cell do not exceed 10 ns, to DRAM – are 60 ns or more. It is not surprising that the performance of the latter in solving AI tasks depends almost to a greater extent on the amount of SRAM located in close proximity to the computing cores of server graphics adapters (strictly speaking, the logical circuits themselves also have register memory, but it stores only a few dozen, maximum hundreds of bytes) than on the amount of video memory, the cores communicate with which via a noticeably narrower bus and are characterized by greater latency. Thus, having published the specifications of the H100 accelerator, the official Nvidia website proudly stated that the total volume of local memory and L1 cache common to all its cores reached as much as 256 KB in this case – which is as much as 1.33 times more than its predecessor, the A100.

But why, it would be logical to inquire, in this case are we talking about hundreds of kilobytes for a whole huge server accelerator, and not about at least tens and hundreds of megabytes? The whole point is that, despite its undeniable advantages, SRAM has a couple of extremely unpleasant disadvantages for AI applications. The first is the excessively high price of the issue: allocating six transistor cells for each bit of data means subtracting exactly the same number from the number used to form logical circuits on the same microcrystal. In addition, since the conditional logical unit written into such a circuit continuously “runs in a circle”, consuming energy from the power supply circuit, the heat dissipation of the fragment of the semiconductor crystal occupied by SRAM is generally higher than for its other surface, where the logical circuits are located – activated, albeit quite often, but still only from time to time. Uneven heating makes it difficult to organize effective heat dissipation and leads to faster wear of static memory cells compared to neighboring logic circuits.

Secondly, the miniaturization of semiconductor structures has been going on for several decades in accordance with the refined Dennard law, which we have already discussed in detail: the operating voltage of a transistor is fixed at about 1 V, which in turn prevents the reduction of the geometric dimensions of the transistor channel. An SRAM cell is quite capable of operating at a reduced voltage and with a reduced thickness of the insulating layer that separates the gate from the channel, and will even become more efficient, but since it is assembled from exactly the same elements as the logic circuits next to it, we have to put up with its excessive energy consumption. Moreover, in the course of miniaturization of technological standards, the thickness of the metal interconnections that connect semiconductor elements on a silicon substrate is also reduced. But for them, too, thinning is expressed (due to the fact that the mean free path of electrons is first compared with the half-thickness of the channel, and then exceeds it) in the growth of resistance and an increase in thermal losses when an electric current passes through them. And the more intensively the chip logic accesses the SRAM memory – and in the case of working with AI models, let us recall, the communications between them are extremely dense – the worse such a microprocessor with a fast (and large!) cache will perform in practice as a single unit.

Surface of a pre-production Intel 14th Gen Meteor Lake sample showing the main components of the processor; as you can see, the SRAM caches of various levels account for a significant share of its surface area (source: Le Comptoir Du Hardware)

⇡#Connections matter

«“The biggest problem with machine learning is memory management, not computation,” says Steve Roddy, chief marketing officer at Quadric, and experts who work closely with AI agree wholeheartedly. Since SRAM scales to smaller process technologies worse than logic, it makes sense to allocate fast memory to some kind of intermediate layer between the computing chips equipped with a modest cache and the main DRAM array, but in this case, a bus specialized for high-speed exchange of large data streams is required. This is precisely the role that is being tipped for the relatively new open standard of computer interconnects Compute Express Link (CXL), built as an evolution of PCIe and intended for the exchange of information between central processors, graphics processors and RAM, including at the level of virtual machines.

In the long term, CXL will allow us to radically change the current state of affairs, when the lion’s share of the processor chip area is occupied by SRAM cells – and still this memory is not enough to multiply huge matrices (to which, in essence, all generative AI is reduced) exclusively on the spot – right in this fast memory, with minimal delays in information exchange with computing cores. Alas, designers of graphics accelerators still have to transfer part of the data directly to much slower DRAM. With an effective CXL bus, the system can be equipped with separate arithmetic-logical circuits (possibly with minimal first-level SRAM caches on the same crystals), separately more capacious intermediate SRAM modules, separately DRAM memory blocks for calculations that allow longer delays – and then the processing of AI calculations will go much faster. AMD offers its own approach to solving the problem, now using 3D V-Cache chiplet technology when creating consumer-grade CPUs, not just server ones — with the installation of a separate cache memory chip above or below the plane of the processor crystal itself. True, this is an expensive pleasure, and it also significantly increases the manufacturing time of processors, but if the goal is to significantly speed up the inference (and especially the training) of AI models, it is reasonable to make such sacrifices.

The use of 3D V-Cache involves the use of chiplet technology (source: AMD)

An additional challenge engineers face in their efforts to miniaturize SRAM cells is caused by single-event upsets (SEUs). A “single event” is traditionally defined as the interaction of a microelectronic circuit with a high-energy charged particle, most often an electron from a cosmic ray stream, although SEUs can also be induced by short-wave photons and fast neutrons. The energy of a particle colliding with a semiconductor is transferred to the material, and in the case of a memory cell, this can lead to a change in its logical state (bit flip) – a transition from the original “1” to “0” or vice versa. The smaller the dimensions of the elements that form the SRAM, the less energy stored in a six-transistor cell, the more likely it is that another SEU (and cosmic rays are not uncommon even at the very surface of the Earth) will lead to the appearance of a faulty bit in a given area of ​​the cache memory. Which is extremely unpleasant for the reason that the logic circuits for correcting such errors will have to be placed right there, next to the static memory, again taking away precious transistors from the main “body” of the computing chip. By the way, if for general-purpose computer calculations SEU is just an annoying inconvenience, then taking into account the course for further miniaturization of service microcircuits (used in cars and mechanical engineering, medicine, etc.) from the now familiar “40 nm” – “28 nm” on the scale down, a spontaneous bit flip in the processor cache and especially in the register memory of some industrial controller, left unnoticed and uncorrected, threatens to turn into a big real disaster.

Flex Logix CEO Geoffrey Tate testifies that while in previous decades special “radiation-protected” chips were created only for aerospace applications (since the probability of SEU is higher the thinner the layer of the Earth’s atmosphere above the semiconductor device, which is very effective in counteracting cosmic rays), starting with the TSMC N5 production standard (marketing “5 nm”), the impact of charged particles on SRAM cells has proven too great to be ignored. At the same time, using error correction, thereby reducing the useful (used to form the main logic circuits) crystal area and adding certain delays to the computing process, is still more cost-effective than equipping standard CPUs and GPUs for terrestrial applications with radiation protection, since such a procedure immediately adds 25 to 50% to their cost. “Perhaps in ten years we will still have to shield the most general-purpose chips from radiation,” Mr. Tate nevertheless lamented in 2024, “because it will no longer be possible to reduce the size of memory cells indiscriminately: there is no way to get rid of the influence of charged particles.”

Chips that are resistant to radiation (or more precisely, to the effects of cosmic ray particles up to certain energies) are outwardly little different from those intended for the mass market, and in the future, perhaps, these small differences will be erased altogether (source: Intersil)

Microelectronics engineers, of course, do not give up – and even in the absence of an urgent need to protect processors in PCs and servers on the Earth’s surface from cosmic radiation, they are still looking for ways to optimize the operation of SRAM. For example: since static memory cells are more sensitive to a decrease in operating voltage than logic circuits built on exactly the same transistors, it seems logical to supply different voltage values ​​to the processor logic and to the SRAM integrated on the same crystal – this is the “dual power rail” approach. Another option is to use high-density (HD) SRAM: instead of placing all six transistors of a 6T cell next to each other on a plane, they can be physically implemented on a single “ridge” (fin), if we are talking about FinFET – in this case, the cell will occupy 30-45% less area. Engineers continue to come up with new ideas, simply because the technology of integrated logic blocks on a single crystal SRAM is so well-established and cost-effective that almost any of its improvements, even the most sophisticated ones, turn out to be cheaper and faster in terms of practical implementation than the transition to other types of memory.

⇡#There is such a memory!

However, developments in the direction of a promising replacement for SRAM are also actively underway. For example, if it turns out to replace a six-transistor cell with a single element, this will already lead to a multiple reduction in the area of ​​the chip allocated per bit, even if such an element turns out to be larger than the standard transistor for a given technological norm. The physical implementation of single-element random access memory can be its magnetoresistive variety – magnetoresistive random access memory, MRAM, or simply resistive RAM – abbreviated ReRAM or RRAM; in the latter case, under the influence of the applied voltage, it is not the magnetic properties that change, but the electrical resistance of the material. Alas, both of these options, although in principle allowing the creation of corresponding elements on a single substrate with logical circuits, do not yet allow obtaining memory with speed characteristics comparable to SRAM. Thus, complex multi-level structures (register cells; L1, L2, L3 caches; then DRAM connected via CXL, for example) continue to be more economically advantageous, since they still work incomparably faster.

A visual graph showing the slowing rate at which SRAM bitcell size, expressed in millionths of a square millimeter (vertical), is shrinking as manufacturing standards become smaller and smaller (source: WikiChip)

How will engineers cope if by the time of the transition to the next technological standard it turns out that the SRAM cells formed using it are simply not suitable – for example, they take up too much crystal area with its proportional reduction (i.e. with the same number of transistors used to form logic circuits, on the one hand, and static memory cells, on the other)? This is not an idle question: for an abstract chip, 60% of the total number of transistors of which are allocated for logic and 40% for SRAM, the ratio of the areas occupied by both for the TSMC N16 production standard (marketing “16 nm”) was approximately 82:18, for N5 – already 78:22, and for N3 it approached 71:29. The most obvious solution to this problem may be the already mentioned transition to a chiplet layout of the processor crystal: logic (of course, with registered memory cells, which are unavoidable within the framework of current microarchitectures) on one chiplet, SRAM caches on another; while the chiplet with static memory is manufactured according to earlier production standards – to avoid the burden of problems growing in the course of miniaturization. If we take into account, by the way, that the logical basis of a modern microprocessor itself is becoming increasingly larger and that it is also divided into chiplets by both AMD and Intel, this seems to be a completely reasonable option.

However, it is still too early to write off the good old 6T cells. At the end of 2024, when TSMC published more detailed specifications of its promising N2 process technology, it turned out that the area of ​​the SRAM cells for it is smaller, and the density of the data stored in them is, accordingly, higher than previously assumed by independent experts – based on the analysis of the previous transition, from N3P to N3X. The point, apparently, is that in the case of N2, the Taiwanese chipmaker for the first time in its practice relied on nanosheet transistors with ring gates (nanosheet GAA FET), which simultaneously made it possible to reduce power consumption, and increase performance, and raise the density of these elementary semiconductor devices to values ​​​​unprecedented for FinFET. For comparison: if the first N3 generation process technology was characterized by an SRAM density of 33.55 Mbit/mm2, then the transition to N2 increased this value to 38.00 Mbit/mm2, – accordingly, the area of ​​a single static memory cell on the surface of a semiconductor crystal decreased from 0.0199 to 0.0175 millionths of a square millimeter (10-6 mm2). It is important to note that when the N3 process technology replaced the previous “5-nm” N5, the area of ​​a single SRAM cell practically did not decrease – between 0.0210 and 0.0199 10-6 mm2 there is only about a 5% difference.

Nanosheet GAA Transistor: A Salvation for Classic SRAM? (Source: TSMC)

And of course, the further development of AI computing is not limited to problems with static memory alone. It is enough to mention such a “bottleneck” on the path of multiplying increasingly multi-component matrices as the steady growth of parasitic capacitance and resistance of interconnections, which connect various hardware components of semiconductor computers. In addition to contamination of useful signals with interference (and they have to be removed, including by logical means, by adding new error correction levels), this misfortune results in increased energy consumption and increased latency – i.e., again, it leads to a slowdown in calculations, which are already becoming excessively cumbersome. Here, everything again rests on the outstripping rates of scaling of logical circuits – but this time it is not the elementary SRAM cells that lag behind them, but the actual metal data buses that connect various elements of semiconductor microcircuits.

In short, leading experts in the microelectronics industry today doubt that the extensive increase in transistor density on semiconductor chips will lead AI computing — precisely because of its special demands on the speed of processing fairly simple operations — to the level desired for obtaining, say, AGI (“strong” AI). If today up to two-thirds of the considerable amounts of energy consumed by generative models during inference are spent on moving data between processor cores and the memory subsystem, then it is quite possible that a qualitative breakthrough in the direction of AI will be provided by technologies that differ from classical semiconductor von Neumann machines — neuromorphic computers, analog computers, photonics, or something else. In any case, as long as humanity does not skimp on research in the field of artificial intelligence, all options will be considered in parallel — including the expansion of the “bottlenecks” of the classical von Neumann approach. After all, so much effort and money has already been invested in it – you can’t just take it and abandon it!

Leave a Reply

Your email address will not be published. Required fields are marked *