Neural network photonics as a remedy against the AI ​​energy crisis

The mention of nuclear reactors is not a figure of speech: impressed by the unabated demand for the services of generative models hosted in the cloud, providers of such services – and primarily Microsoft – have already begun to think about attracting small modular reactors (up to 300 MW of electrical power) to meet the needs of their promising data centers, and AWS recently acquired land to build a new data center focused primarily on AI tasks in close proximity to a nuclear power plant in Pennsylvania. According to the International Energy Agency, back in 2022, at the very dawn of the AI ​​boom, all data centers in the world (with the exception of those engaged in cryptocurrency mining) consumed 240-340 TWh of electricity per year, or 1.0-1.3% of the total volume of its production on the planet, and cryptomining operations added at least another 0.4% to this value. Energy experts quoted by Scientific American estimate that at least one and a half million units of AI servers on NVIDIA adapters alone will be implemented globally by 2027, and that their total energy consumption alone will exceed 85 TWh annually. That is why we are talking specifically about nuclear reactors – today there is simply no alternative to them in terms of the efficiency of generating a conventional 1 kWh (with the common assumption that cold fusion will remain untamed in the next 5-10 years).

Solar panel fields have a carbon footprint (the number of grams of CO2 emitted into the atmosphere during 1 kWh of production) that is four times that of a typical nuclear power plant (source: IPCC)

Moreover, the forecast made at the end of 2023 regarding 85 TWh of power consumed by AI per year may in fact turn out to be an over-optimistic lower estimate: according to Bloomberg, the organizers of data centers in Northern Virginia are already looking for the opportunity to reserve several gigawatts of nuclear reactor capacity in the state, and the number of startups backed by Sam Altman, head of the notorious OpenAI, has recently been supplemented by Oklo, which is developing low-power nuclear reactors – each of which is perfectly suited to powering a medium-sized AI data center. An interesting fact: representatives of Oklo itself confidentially told NBC News that coordinating projects with regulators and collecting all the necessary permits requires much more effort from them than searching for potential clients, among which hyperscalers relying on AI are far from the last place. In short, although the shortage of NVIDIA server AI adapters promises to be resolved soon, the rapid growth in the use of “heavy” (not suitable for running, much less for training on home hardware) generative models threatens to be limited in the medium term by an acute shortage of electricity.

Or is it still not a threat?

⇡#Useful dogmatics

The energy efficiency of the human brain causes acute envy among microelectronics engineers: say, for calculations with a performance of approximately 1 exaflops (1018 operations per second), a modern supercomputer of the Oak Ridge Frontier level has to spend up to 20 MW – while the brain of a conventional chess grandmaster while thinking about a complex a game (to win which, the supercomputer will have to squeeze out, if not the entirety of that same 1 EFLOPS, then a noticeable share of it) will cost approximately twenty watts of power. Yes, as the technological standards for the production of semiconductor chips are miniaturized, their energy efficiency is increasing, but, alas, at a rate that is by no means sufficient to reduce the gaping gap of several decimal orders.

A supercomputer, of course, is capable of solving a number of problems much more efficiently than the most outstanding person – we are talking about well-algorithmed precise calculations. Problems begin where algorithmization ceases to be an optimal strategy. The same chess game is technically not difficult to reduce at any stage to a simple enumeration of all possible options for a move that can be made in the current position right now, plus all those secondary (and subsequent) options that generate each proposed move. This task is highly algorithmic; the only trouble is that the total number of position options branching at each step turns out to be excessively large, which forces computers built according to von Neumann principles to spend a lot of energy on such calculations. The biological brain is structured differently and relies not so much on exact calculations (try immediately raising the number π to the power of the base of the natural logarithm e with an accuracy of at least FP8 – and for a computer this is a little more complex problem than “3.0 to the power 2.0″), how much for qualitative generalizing assessments and heuristics. And it is precisely in terms of developing such computing systems that would rely on similar biological principles that photonics turns out to be an extremely promising direction.

Drawing of different types of neurons in the chick cerebellum and their interconnections by Spanish neuroanatomist Santiago Ramón y Cajal; publication 1905 (source: Wikimedia Commons)

Back at the end of the 19th century, when optical microscopes, imperfect by today’s standards, did not allow one to clearly discern the structure of biological tissue, the neuron doctrine was established in science – the idea of ​​​​an isolated specialized cell, a neuron, as the basic structural and functional unit of the nervous system. Today it is known that the human brain consists of 80-120 billion neurons of many varieties, connected to each other by rather complexly organized contacts – synapses. From a neurobiological point of view, this is a significant point: just a little over a hundred years ago, a number of serious researchers, including the inventor of the “black reaction” Camillo Golgi, opposed the neural doctrine with the reticular hypothesis – the assumption that the network of connections in the gray matter of the brain is continuous , and the nodules-neurons in its composition, precisely thanks to this continuous transport structure, directly exchange certain “plasma substances”, which is what nervous activity comes down to.

Neuronal dogma implies a much more complex – indirect, through synapses – organization of the transfer of excitation between individual neurons. Each such neuron, as was confirmed by direct observations already in the middle of the 20th century, contains several protoplasmic projections called dendrites – they receive signals from other neurons, pre-process them and transmit them to the cell body. The cell body itself is responsible for the final processing of signals coming from all dendrites and the formation of a single outgoing signal. And this, in turn, is transmitted along the axon – an extended fiber, which at the end branches into many collaterals that form synaptic contacts in the nervous tissue with the dendrites of other neurons.

Schematic structure of a multipolar (i.e., having many dendrites) neuron – these are the ones that form most of the nervous tissue of the brain (source: Wikimedia Commons)

At first glance, it may seem as if computer neural networks of the currently generally accepted design – on many layers of perceptrons, multi-layer perceptrons (MLP) – implement precisely the reticular hypothesis rejected by neuroscientists. After all, on the diagrams illustrating layers of artificial neurons permeated with mutual connections, there are no complexly arranged boundaries between their individual nodes. However, in fact, modern LLM and other machine learning systems embody precisely the neural doctrine, since learning itself is reduced to the formation of a certain set of weights at the inputs of each of the perceptrons in each of the working layers of the neural network (remember that digital neural networks are implemented in the form of mathematical abstractions – by emulation on von Neumann machines: a neuron corresponds to a perceptron model with many inputs and outputs; the former can be interpreted as dendrites, the latter as collaterals at the end of an axon). And this exactly corresponds to the peculiarities of the neural structure of the nervous system: on dendrites, in the process of passing signals (learning in the broadest sense of the word), so-called dendritic spines are formed – special membrane outgrowths, which, in fact, become the basis for the formation of synapses. During training, the number of spikes demonstrably increases, which can be considered as a direct analogue of adjusting the weights on the inputs of perceptrons during machine learning. The reticular hypothesis assumed that the channels between the nodes of the neural network are purely transport, end-to-end, and that all data processing operations are performed in the neuron bodies themselves. By the way,The structure of the Kolmogorov-Arnold Network (KAN), recently proposed as an alternative to MLP, moves from linear weights at the inputs of perceptrons to learnable activation functions, which brings the artificial neuron system even closer to its natural prototype.

The fundamental difference between biological and digital neural networks, of course, remains, and not only at the level of general structure (in the brain, neurons are grouped into specialized sections – zones – to solve various problems, while modern neural network models in this sense are for the most part “homogeneous” “), but also in application to the organization of interneuron contacts. Synapses, after all, can be purely electrical, purely chemical or mixed, and the work of the last two types is greatly influenced by neuromodulators – chemical substances that enter in one way or another into the environment of the nervous tissue and modify the passage of the signal through the synaptic cleft. However, perhaps, at least at the current stage of technological development, humanity is even calmer from the fact that already often hallucinating neural networks are not able to be artificially “invigorated” by injecting digital analogues of some such neuromodulating substances.

⇡#But it is not exactly

The fundamental advantage of calculations performed by von Neumann machines with sequential execution of operations, as already noted, is their accuracy, determined, in turn, by the Boolean algebra underlying the operations they perform. When any variable can take only one of two values ​​- “true” or “false”, and the calculation procedure is clearly defined algorithmically, no problems arise with achieving any given accuracy: for the notorious number π, say, approximately one hundred trillion digits after comma. True, computers of the von Neumann architecture have at least three weaknesses: the time required to perform cumbersome calculations (for some tasks it exceeds the estimate of the actual duration of the existence of our Universe), the amount of energy consumed in the process of this kind of calculations (and hello again, nuclear power plants !), as well as the inability to effectively solve complex problems in the absence of a clearly defined algorithm, at least qualitatively. A classic example is the traveling salesman problem: it has been proven that there is no solution algorithm for it that would have at least power-law complexity (i.e. for which the number of calculation steps is a constant to the power of n, where n is the number of destinations on the map of that same traveling salesman).

The so-called scheme loop neuron, proposed for implementation in optoelectronic neural networks, closer to the biological prototype than the classical perceptron: N – neuron body, D – dendrites, Se and Si – excitatory and inhibitory synapses, T – transmitter (source: NIST)

Like any human-made instrument, the von Neumann computer has proven to be well suited for a certain, albeit very wide, range of tasks, but beyond this area the wisdom of its use becomes increasingly questionable. For example, calculating the trajectory of a space station launch to Pluto and further to the objects of the Kuiper Belt is a task for a computer system operating rectilinearly within the framework of Boolean algebra: the laws of celestial mechanics are flawlessly algorithmized, and if some disturbances arise during the journey, it will not be difficult to correct the course in the same way by a clearly calculated impulse from the onboard propulsion system.

But a tiny (compared to even the most overloaded asteroid) device in the endless cosmic void is affected by only a limited number of significant forces: the initial impulse given to it by the rocket’s upper stage, the gravitational field of the Sun, the attraction of one or two nearby planets on each section of the trajectory, and everything else. can be neglected with very good accuracy. Alas, problems of this type, strictly speaking, do not occur very often in nature: situations are much more common when it is necessary to take into account the interaction of a huge number of relatively small, but almost equal forces at once. Well, let’s say, how to calculate the shape that the most ordinary soap film will take, hooked up with a wire frame? There is no uncertainty here: it is well known that surface tension forces will certainly force the surface to bend in such a way as to minimize the potential energy generated by these forces themselves. But the more complex the shape of the frame, the more cumbersome the algorithmic calculations become, while in the real world the soap film takes the “desired” shape on some polygonal-star-shaped, and even non-planar, frame with the same ease as on a perfectly annular one, without making any preliminary calculations.

According to Goldman Sachs Research, the total increase in data center energy consumption in the United States and the rest of the world, driven by the rise of AI, will be about 200 TWh annually between 2023 and 2030 (source: Goldman Sachs Research)

This is precisely the key point for understanding why, simultaneously with von Neumann computers, humanity also needed computational neural networks. The latter are not at all an evolutionary development of the former; a huge class of important problems that require an exact solution according to strictly specified algorithms, systems based on Boolean logic will continue to successfully grind in ten, and, presumably, in a hundred years. But there is another, perhaps even more extensive class, implying effective parallel processing of vast amounts of data using machine heuristics, which is not determined algorithmically, but is developed, like human intuition, on the basis of training – for this category of problems there is no way without neural network computers not enough. And it is much more reasonable to use specialized hardware platforms for their construction – the same optoelectronics – rather than emulate them again and again on ultra-high-speed and monstrously energy-hungry von Neumann servers. To understand the scale of the problem: International Energy Agency analysts estimated the energy consumption for processing a single request to ChatGPT at 2.9 Wh, while processing a user request to a standard search engine costs, for example, Google data centers 0.3 Wh .

Strictly speaking, the microcircuits of semiconductor architectures that are familiar today also basically contain elementary units – switches, also known as transistors – from which, through the efforts of VLSI designers and manufacturers, the Schaeffer strokes we previously considered are sequentially assembled, then adders and even more complex circuits. But the fact of the matter is that even these basic elements are subject to the laws of Boolean algebra – they produce strictly one or strictly zero at the output, depending on their design and the values ​​of the input signals – and therefore are doomed to produce extremely accurate calculations. There is simply no scope for learning and the heuristics it generates (if it is organized properly, of course), and there is no need to talk about the energy efficiency of semiconductor computing. Biological neurons are structured completely differently, so choosing suitable physical systems to adequately reproduce the principles of their operation should certainly be energetically more profitable than emulating them with virtual perceptrons in the RAM of a von Neumann machine.

The operation of a classical artificial neuron – a perceptron – is not so difficult to simulate on a von Neumann machine. It’s another matter when such perceptrons have to be represented in digital form in billions at once, and even in close interaction with each other (source: Wikimedia Commons)

It is more profitable because to build a truly well-trained universal network (and not just one that distinguishes, say, cats from dogs based on a photo, and nothing more), and even according to current quality standards and a multimodal neural network, billions of neurons are required, and therefore any gain in energy efficiency is level of one of these basic elements will result in a huge advantage for the final system in terms of the cost-effectiveness of the calculations performed. And, as we have already noted, the more humanity relies on a variety of (not only generative) machine learning models, the more acute the question arises – is it ready to afford to allocate an increasingly significant part of its global energy budget to the execution and training of these models? ?

⇡#SUDDENLY, SQUID

Comparing the biological brain with semiconductor computing systems directly, at the level of basic elements, is not in favor of the former: the signal through chemical synapses obviously travels slower than the electrons move along the channels it opens with transistor gates. However, the advantage of a distributed neural network structure with its inherent ability to learn over integrated circuits that are rigidly ossified within the framework of Boolean algebra turns out to be impressive. Thus, just recently, researchers from the Massachusetts Institute of Technology experimentally proved that the entire process of image recognition – from the appearance of a picture in front of the subject’s eyes to its general categorization by the brain – takes only 13 milliseconds, whereas previously it was believed that this could take 100 ms and even more. We are talking specifically about attributing an image to a certain large category – “landscape”, “man”, “quadruped” – and not about perceiving it in all its detail, of course; but still impressive. It’s so impressive that it’s forcing neuroscientists to reconsider decades-old ideas about how nervous tissue actually functions. This means that developers of artificial neural networks have new food for thought.

The Loihi 2 neuromorphic chip, made according to Intel 4 technology standards (roughly corresponding to “7 nm” in TSMC terms), contains about 1 million artificial neurons – and still remains in the research project stage; its commercial release is not yet planned (source: Intel)

The classical description of the work of a neuron (visual cortex, in particular) implies the collection of long series of signal impulses from the axons of “higher” neighbors through dendrites, the aggregation of these impulses in its own body, the formation of an output signal and its transmission through its axon. Everything should take, if you meticulously count all the movements of ions (charge carriers) through synaptic clefts, more than one hundred milliseconds. The experiments of researchers from MIT and their predecessors have shown that some neurons in the visual cortex skimp on the execution of the full cycle “accumulation of input data – calculation – sending a signal to the output” and literally immediately, upon receiving two or three nerve impulses, they are already forming their output signal . This leads to the following conclusion: for the firing of a neuron (at least some of them), not only the number of output impulses received plays an important role, but also the delays between them – so important that the general categorization of the image becomes possible literally “with two notes”; more precisely, by two, or even one characteristic interval between pulses.

It is clear that such abilities arise only during learning and that after the initial assignment of a picture to some group according to a rough assessment of those same “quick-fire” neurons, a more thorough verification occurs when signals from more “ thoughtful” of their colleagues. But the fact remains: a biological neural network, due to the even greater coarsening of the operations performed (already far from the high standards of Boolean algebra set by semiconductor computing systems), manages to overcome its own objective limitations and demonstrate performance per watt of expended power much higher than expected. Artificial neural networks have something to strive for!

This is what the dendritic structure of just one biological neuron looks like (source: Harvard University/Google)

And they, in fact, strive, relying, among other things, on the optoelectronic element base. Yes, of course, from a production point of view, it is easier for developers to deal with semiconductor structures, one way or another torn from the tight embrace of Boolean algebra – for example, it is possible, after all, from fixing the very fact “there is a signal – there is no signal” at the output of a certain circuit to accurately measure the value output current and, based on this value, make further decisions – for example, whether to transmit the signal further along the circuit or not (implementation of the simplest artificial synapse). But transistors located on a plane and connected by metal buses in interconnect layers still have extremely limited ability to form truly neuron-like structures.

For example: in a study published in 2001 by Google and Harvard researchers, it was found that for 86 billion neurons in the studied brain there are about 100 trillion synapses – that is, on average, each neuron maintains about 1 thousand contacts with its neighbors (moreover By the way, a significant proportion in the sample, up to 10%, consists of multisynaptic contacts between the dendrite of one neuron and the axon collateral of another, which further complicates the structure of nerve signal transmission). A semiconductor neuron-like circuit is physically incapable of contacting more than a dozen neighbors – otherwise the structure of interconnections becomes excessively complex, even (and even more so) if it is implemented using multilayer microcircuits. We should also not forget that for multilayer chips, as we have already said, one of the most important limitations to productivity growth is the difficulty of dissipating heat from the intermediate layers, because step-by-step energy efficiency is low (in the sense of the amount of heat dissipated during the production of each elementary Boolean operation, i.e. . changing “1” to “0” or back during the passage of a signal along a certain circuit) of semiconductor VLSIs has not gone away.

The basic design of a composite perceptron, implemented on an integrated optoelectronic circuit: the signal enters the circuit from the emitter on the left, is divided into four channels with different delays (each additional loop of the light guide adds a certain standard delay), then each of the channels is subjected to thermal phase modulation (which exactly corresponds to the use of input weights adjusted during the learning process) – and ultimately the resulting modified signals are summed up, forming a single output (source: Università degli Studi di Trento)

A logical way to bypass this limitation seems to be multiplexing of signals – when a single communication channel is used to transmit multiple information streams at once. When applied to semiconductors, multiplexing is complicated by the need to accurately distinguish subtle gradations of the values ​​of the transferred charge (already extremely small in modulus, by the way, if we talk about the internal buses of integrated circuits) in electrical circuits. How great the complexity of the task is can be judged by the extremely leisurely transition from single-level NAND memory cells to three- and four-level ones, which lasted for many years, as well as by the rather harsh criticism of promising five-level ones in terms of the ratio of the advantages and disadvantages they provide – in comparison with the same four-level. And this only applies to data storage – with dynamically changing signals in electrical circuits, the situation is even more complicated.

But multiplexing works great for fiber optic communication lines – which means it will also be effective for optoelectronic systems! In addition to creating perceptrons themselves (for example, combining temporal and spatial multiplexing), it is possible to directly integrate photonics with quantum effects by connecting superconducting nodes in which quantum interference is realized with multiplexing waveguides. This may seem like an excessive complication of technology: what, one might ask, is the logic of jumping over the stage of “ordinary” neural networks with the consistent development of all the necessary technologies and processes straight to quantum ones? Yes, the fact is that a “regular” computing circuit needs to be held in certain areas of the signal for a sufficiently long time – for example, while waiting for the end of certain parallel calculations. For this purpose, photonics uses optical cavities, in which a standing light wave is formed – rather bulky sets of reflective elements, the losses in which are quite large, which, alas, reduces the energy efficiency of the final computer. In addition, each optical resonator is tuned strictly to a single wavelength, and this further complicates the design of optoelectronic circuits with frequency multiplexing waveguides.

Superconducting hybrid (photonics + quantum system) circuit is an analogue of a biological synapse: the optoelectronic part forms waveguides, Josephson junctions (between two large loops at the bottom right) implement single-photon fixation (source: NIST)

Specialists from the American National Institute of Standards and Technology (NIST) have proposed a hybrid computing circuit to solve this problem, which can be considered an analogue of a biological synapse – an interneuron connection: it is its behavior that determines whether (and which one) will be transmitted an impulse from the axon collateral of the previous neuron in the conditional signal chain to the dendrite of the next one. The ingenious idea of ​​this group is that they not only reduced the energy consumption of the final computing system, but reduced it almost to the fundamentally possible minimum – since the unit of information transmission in it turns out to be exactly one photon (what, one wonders, could be more energy efficient?! ). Perhaps this turned out to be due to the Josephson effect, or more precisely, Josephson junctions implemented on its basis.

A superconducting circuit with a pair of such transitions forms a superconducting quantum interference device (SQUID), the current in the circuit of which varies in the range from zero (if the currents in each of the superconducting loops forming it are multidirectional) to a certain maximum value (if mutual reinforcement of co-directed currents occurs ). The highest sensitivity of SQUID to the magnitude of the magnetic field in the region where Josephson junctions are located makes it possible to detect single photons – which themselves do not carry a charge, but create an electromagnetic field in the process of movement (more precisely, they themselves are this field – its actual quanta) .

The resulting hybrid system makes it possible to recognize signals lasting only 2 picoseconds, and after each photon passes through the detector, the amount of current that circulates in its superconducting loops increases by a conventional “unit”. It turns out to be a kind of analogue of a biological synapse – all that remains is to transmit the accumulated signal to a certain perceptron external to the SQUID (how exactly it will be implemented is a separate question). The researchers claim that the peak firing frequency of their synapse exceeds 10 MHz, and the energy consumption is at the level of 3.3 × 10−17 J per photon. For comparison, the maximum “resolution power” of neurons in the human brain is estimated at 340 Hz, and they spend approximately 10−14 J of energy on recording each synaptic event. In short, NIST managed to create (not just design, but implement in practice) an analogue of a biological synapse, which turned out to be even several orders of magnitude more efficient than its prototype.

It is clear that it is too early to celebrate another victory over inert matter and proclaim the beginning of the era of ultra-economical neurocomputations – SQUID synapses with single-photon waveguides alone are not enough for this; it is still necessary to select and implement “in metal” perceptrons that are not too inferior to them in efficiency. But this direction of development of quantum photonics is already considered as extremely promising, especially since the prize for achieving the desired goal here is more than enviable, even taking into account the obviously large-scale investments required – saving many tens of terawatt-hours of electricity on a planetary scale annually.

Related materials

  • The vice president of Nvidia moved to the startup Lightmatter, which creates AI chips with silicon photonics.
  • Intel introduced the OCI photonic interconnect: 2 Tbit/s in both directions at a distance of 100 m.
  • Intel intends to address bottlenecks in AI systems by accelerating memory and network interfaces.
  • Intel introduced the Hala Point neuromorphic computer based on 1152 Loihi 2 chips with a brain-like architecture.
  • TSMC considers the development of silicon photonics important in the context of the boom in artificial intelligence systems.
admin

Share
Published by
admin

Recent Posts

GPUs limit programming freedom, so more chips will appear in the field of AI – Lisa Su

GPUs, originally created for creating three-dimensional images, have performed well in the field of accelerating…

6 mins ago

Samsung Display will build an OLED display plant in Vietnam

South Korean electronics maker Samsung Display plans to invest $1.8 billion this year to build…

12 mins ago

Intel’s takeover by Qualcomm is unlikely to be approved by antitrust regulators, especially in China

Those wishing to believe in a successful outcome of Qualcomm's initiative to acquire Intel assets…

2 hours ago

AT&T reluctantly agreed to remove tens of tons of lead from the bottom of Lake Tahoe

US telecom operator AT&T has agreed to remove abandoned lead-sheathed cables that have led to…

3 hours ago