In our previous materials about hardware neuromorphic computers, we briefly discussed the general principles of their operation and one of the most successful semiconductor implementations of this kind of systems – Intel Loihi and Loihi 2 chips. At the same time, we repeatedly emphasized that although hardware neuromorphics theoretically seems to be an incredibly promising direction the development of AI, especially from the point of view of energy efficiency of the calculations performed, in practice it is held back by a number of serious obstacles – overcoming which will require developers to it seems like more effort and time than initially imagined by enthusiasts of this trend. At the same time, neural networks that are familiar today, entirely implemented in software – in the RAM of classical von Neumann computers – also do not stand still. And, although they are very expensive to train and operate (if you take into account the astronomical energy consumption of AI servers with Nvidia graphics adapters, first of all), the results are quite tangible and attractive. Neuromorphic computers continue to largely remain experimental prototypes, and not workhorses for customers in various sectors of the economy who need to solve AI problems – and their developers clearly need to do something about this.

⇡#For (long? good?) memory

«Ordinary neural networks, called simply artificial neural networks (ANN) in English, can be more strictly classified as feed-forward neural networks (FFNN). The information signal in them moves only in one direction, from the input layer of perceptrons through the hidden ones to the output one, without forming loops and/or return flows – and this, as practice shows, is quite sufficient for solving a huge class of discriminative problems, such as image discrimination cats and dogs, or handwritten letters and numbers, or facial recognition, etc.

The signal passing through the neural layers does not leave behind an information trace; In order to adjust the weights on the inputs of perceptrons (which is necessary during training, when the result produced by FFNN does not correspond to a given standard during supervised learning, for example), special mechanisms should be provided. For neural networks emulated entirely in computer memory, there is no problem with this: each weight is a number, and changing the value of the desired variable in the table is trivial. But from the point of view of hardware neuromorphics, organizing reliable fast access to hundreds and thousands (or better yet, millions and billions) of conditional rheostats, which would set weights at the inputs of the physical perceptrons associated with them, is a deeply unusual task, and the difficulty of its implementation is precisely one one of the strongest factors that hinder the development of truly large hardware neural network computers.

One of the possible implementations of an RNN cell with a special gate that activates “forgetting” (erasing) information previously stored in local memory (source: Wikimedia Commons)

With recurrent neural networks (RNN), a subtype of which is the impulse neural network (SNN) we discussed earlier, things are even more interesting. In RNN, neurons exchange information among themselves, and do not simply transmit it strictly forward, from layer to layer; in particular, they have the ability to check data about their previous states in the process of changing the current one under the influence of the next portion of information received by them. Essentially, RNNs are neural networks with internal memory (an example of which is the device of Loihi chips with SDRAM cells tied to individual artificial neurons), and therefore they are best suited for processing sequences of data. Not a static picture, the image on which needs to be classified (cat/dog), but a chain of events extended over time; say, a sequence of notes in a generated musical composition, when, based on the concepts of harmony “learned” by the neural network and the genre framework specified by the operator, the sequence of sounds already selected at the previous stages is supplemented with a new one, which is certainly not dissonant with the previous ones. Experts compare FFNN with simple mathematical functions: here is a certain strictly specified (in this case, by weights at the inputs of perceptrons) sequence of operations, here are the input data – and as a result, in one pass, a certain well-defined answer is obtained, as if by a formula. RNNs, on the contrary, are more reminiscent of programmable computers: the formula for obtaining an answer is itself set by incoming information. Strictly speaking, the presence of internal memory makes recurrent neural networks Turing complete, i.e.capable, given sufficient time, to solve essentially any computational problem.

This, by the way, is not a compliment at all: if any, then, in particular, obviously malicious too, which immediately opens up the widest scope for potential hacking of neuromorphic RNN computers. Which cannot be prevented, as they say, by design – since the very potential possibility of hacking turns out to be due to the inherent properties of the internal structure of such systems. Actually, it’s the same story with the most advanced natural neural network structure known to us, the human brain: no number of ways to hack this Turing-complete machine have been invented – from propaganda and fraudulent tricks to chemical activation of certain neural connections induced from outside the body. So, if in the foreseeable future sophisticated neuromorphic computers nevertheless take the place of the current “smart” chatbots emulated in the memory of x86 servers, it will be possible to fool their heads – or whatever the hardware RNN containers will be called – with much greater efficiency than today implement jailbreaking of large language models.

The LSTM cell contributes to the sequential processing of data arriving at the artificial neuron of the spike network, but at the same time retains information about its previous values ​​until an explicit instruction is received to “forget” it (source: Wikimedia Commons)

Memory as part of an RNN can be organized in a rather complex way – it is worth mentioning in this regard the “extended short-term memory” LSTM, long short-term memory, capable of storing information not only about the previous state of the cell, but also about (several in the general case) earlier ones . By controlling special forgetting gates (forget gate), the system thus gains the ability to work with very long series of input data—spike impulses in the case of SNN. Just neural networks with LSTM cells (implemented, of course, exclusively at the software level) until the boom of generative AI were used mainly in machine translation systems, since they well ensure the preservation of not only the context of the phrase being translated at the moment, but also syntactic and even stylistic features of the entire text as a whole (more precisely, the corpus of texts on which training is carried out). And even today, given the tendency of generative models to hallucinate, RNNs with LSTM are not going to leave the scene. They largely rely on tools for automated generation of program code, dividing text into meaningful words, auto-completion of various forms, and many other “smart” tools for which the power of universal FFNNs turns out to be excessive. Moreover, this power, as has been mentioned more than once, is not obtained for free (and primarily in the energy sense), and RNNs, even in the form of computer emulation in the memory of x86 machines, are still more economical.

⇡#Pull your tongue

If RNNs are so good at analyzing the structure of sequences of almost arbitrary data – stock quotes, literary/technical text, musical compositions, a graph of seasonal changes in temperature and humidity in a given geographic location, etc. – why are they so deafeningly popular today? Generative models based on FFNN rather than those based on recurrent neural networks? Largely because RNNs in their original form are not focused on providing meaningful information: they brilliantly cope with identifying patterns, but not with distilling meanings. FFNNs (especially the most relevant ones, with the active use of convolutions and transformers), although they hallucinate from time to time, thanks to their multi-layered design, snatch certain meanings from the array of training data and that is precisely why they distinguish a cat from a dog in the picture quite confidently. Yes, the machine learning system itself is not aware of these meanings—the contours of reflection and introspection are absent in a multilayer neural network. But the generative model undoubtedly captures the abstract ideas of certain “catness” and “dogness” as objectively understandable characteristics as a result of the analysis of tens of thousands of images – at the level of vectors associated with the corresponding tokens in a substantially multidimensional space, determined, in turn, by the weights at the inputs of its numerous perceptrons .

Schematic diagram of SpikeGPT operation based on a language model, which is built on the principles of a key value weighted by susceptibility – Receptance Weighted Key Value, RWKV (source: University of California, Santa Cruz)

RNN, on the other hand, is focused, rather, on isolating a time-extended structure in the sequence of data offered to it, be it bars of a musical composition, words in a sentence, or operators in a fragment of program code – namely the structure, not the content. This is precisely why recurrent neural networks are so good at “nonsense generation” – when based on an array of source data fed to the model (Shakespeare’s plays, articles from an online encyclopedia directly with XML markup, essays on algebraic geometry in LaTeX format, i.e. . directly with mathematically correct formulas and diagrams, etc.) RNN generates output, first. at a quick glance, indistinguishable from the original – in terms of grammar, syntax, even stylistics – but most often containing absolutely no meaning. Such phrases, composed of dictionary words according to all the rules of grammar, but devoid of any substantive content – they are represented by the example known to everyone who has studied the philosophy of science, “the moon multiplies quadrangularly” – Bertrand Russell classified them as “nonsense of the second type” ” Outwardly, they are not much different from the hallucinations of generative models, but at a deep level the difference is significant. It’s one thing for a vector to fail, which was supposed to point to a certain well-defined (due to preliminary training of the model) region in the latent space, but for one reason or another missed; the other is obviously and did not involve any “extraction of meanings” at an earlier stage, combining elements that are in no way related to each other (for a given neural network) according to well-established (by the same neural network) formal rules.

However, at the current stage of development, RNNs are approaching in terms of “extracting meanings” to FFNNs based on the architecture of transformers, up to the appearance (so far in the form of an experimental, inaccessible to the general public, but fully functioning prototype) generative model SpikeGPT, which operates with key quantities weighted by receptivity – receptance weighted key value (RWKV). RWKV blocks open up the possibility of accelerated training (due to parallelization of threads) for a pulsed neural network, but significantly longer training reduces the practical value of even a purely virtual implementation of modern RNNs in comparison with FFNNs. Nevertheless, one of the main advantages of a recurrent neural network is the linear dependence of computational complexity on scale (and not quadratic, like that of transformers), so RNN models with a comparable number of parameters will certainly be more energy efficient than today’s common generative systems built on transformer architecture. Thus, the mentioned SpikeGPT in versions with 45 million and 216 million parameters produced, according to its creators, twenty times fewer computational operations than its rivals of comparable complexity based on transformers, while demonstrating comparable results in a number of tests that are significant for assessing capabilities machine learning systems.

The general principle of operation of the self-attention layer in the transformer architecture (source: The Illustrated Transformer)

According to Mike Davies, director of Intel’s Neuromorphic Computing Laboratory, scaling spiking neural networks to the world’s dominant large FFNN-based language models with transformers will continue to be a major challenge until an efficient hardware basis for RNNs is proposed ( he spoke even more specifically about semiconductor neuromorphic computers: “This is going to be a really exciting path forward in this domain, and while we’re not there yet—we will need a silicon iteration to support it”). And it’s clear why: emulating complex neural networks in the memory of von Neumann computers is prohibitively expensive precisely because of the most important architectural feature of these computing systems. Namely, the physical separation of the data storage (memory; in particular, high-speed RAM) from the node (processor) that actually performs the calculations. The more complex a neural network is, the more data needs to be moved between RAM and the CPU to keep it running—and limited bandwidth becomes one of the key barriers to scaling such a system. The Loihi chips we have already reviewed and a number of similar semiconductor implementations of neuromorphic computers are designed precisely by placing memory cells closer to primitive, but high-speed processor nodes, to realize the advantages of RNNs using already well-established semiconductor manufacturing technologies.

⇡#Problems and solutions

An important reason, among others, due to which the same Loihi 2 chips have not yet supplanted Nvidia server adapters from data centers around the world that devour watts by many hundreds at a time, is the difficulty with training recurrent neural networks. It is not enough to create a hardware basis for their energy-efficient operation: if the smart machine obtained in this way begins to generate responses with a significantly higher percentage of incorrect responses than the same GPT-4o or its analogues, the very fact of a significant reduction in the energy intensity of such a device is unlikely to console its users. The problem is that the most common method of training FFNN (perfectly suitable for models enhanced with transformers), back propagation, cannot be directly applied in the case of RNN. Since the cells of a recurrent neural network somehow store information about previous states, simply changing the weights on the inputs is not enough – you also need to influence the “memory” of their past values, i.e. apply back propagation through time, BPTT.

Each RNN block, changing its state under the influence of the next impulse, retains memory of the past – and when training with backpropagation, this must be taken into account (source: Stanford.edu)

But with this, too, not everything is simple, and at the fundamental, mathematical level. The main working tool for setting up neural networks – the gradient – is a vector of partial derivatives of the loss function over all adjusted weights: it is this vector that indicates the direction of the greatest growth of the loss function over the entire set of weights at once. Partial derivatives reflect, often extremely strong, differences between the values ​​of individual weights: this is typical, by the way, not only for RNNs, but also for multilayer FFNNs. As a result, unpleasant situations often arise when the gradient explodes (exploding gradient) or, conversely, disappears (vanishing gradient), i.e. its value either overflows the data type in which it is stored, or decreases to a negligible value – and in this case, the error, as is easy to understand, no longer propagates, i.e., learning actually stops. They fight this in the case of SNN, in particular, by organizing training through time-dependent plasticity of impulses (spike timing dependent plasticity, STDP), which, by the way, is also characteristic of biological neural networks, but to implement STDP, unlike BPTT, much more sophisticated algorithms, the development of which in itself is akin to art and is largely based on the features of the hardware implementation of a particular neuromorphic system. We are already starting to talk about meta-learning – that is, about training recurrent neural networks not to solve a specific problem, but in general on how they themselves can learn to solve such problems.

The list of challenges facing developers of neuromorphic systems is truly enormous – and we, mind you, have not yet even begun to consider possible options for their hardware implementation, except perhaps semiconductor, using the example of Loihi given in the previous article of this series. It is enough to point out only some of the most significant:

  • Technical complexity – a huge number of interconnected components, which (if we are not talking about well-functioning semiconductor production, and neuromorphic semiconductor computers have enough of their own limitations) today are extremely difficult to produce on a mass scale with the required level of quality,
  • Difficulties in the integration of neuromorphic systems with conventional semiconductor computers – for example, for optical neuroprocessors it is necessary to develop nodes to interface them with electrical circuits, and photonics, as we have already noted, is not in itself the simplest engineering and applied field,
  • Scaling of neuromorphic systems is complicated by a multiple increase in technical complexity (compared to the original prototype) and a corresponding increase in the probability of failures,
  • Their training, as already noted, in itself represents a considerable challenge – due to the significant frequency of explosions/disappearances of gradients, and the need to prepare specialized arrays of training data and create adequate performance benchmarks further blurs the concept of “training quality”,
  • The very location of neuromorphics as a branch of scientific engineering at the intersection of biology (in terms of studying the characteristics of natural prototypes), cybernetics, microelectronics and many other disciplines puts forward almost prohibitive demands on those who want to try their hand at this, and the lack of qualified personnel does not at all contribute to the acceleration progress in this direction,
  • Hardware neuromorphic systems, since almost none of them are based on well-established and mass production processes, are extremely prone to, to put it mildly, imperfection – the performance characteristics of their various (of the same type!) nodes can differ noticeably in performance/reliability and other characteristics, which directly affects the quality of operation of the finished neuromorphic system as a whole.

Assembly of 16 semiconductor neuromorphic chips created as part of the DARPA SyNAPSE (Systems of Neuromorphic Adaptive Plastic Scalable Electronics) program: 28 nm process technology, 1 million artificial neurons and 256 million synapses in each chip (source: Wikimedia Commons)

Nevertheless, the researchers are not backing down: the potential benefits from introducing neuromorphic systems into regular use are too great, even if they will be used for a limited range of tasks and will not completely replace FFNN neural networks built on transformers. The previously mentioned Mike Davis in an interview with EE Times gave the following example: according to the energy delay metric (combined accounting of the energy spent on performing a certain task and latency when executing this task on a computing circuit), hardware neuromorphic systems are able to provide superiority over generative neural networks executed in memory of von Neumann machines, by three decimal orders of magnitude. Among the problems that can be solved almost exclusively by neuromorphic computers – and, in an even more distant future, by quantum ones (simply because conventional hardware is too inefficient to solve them) is quadratic unconstrained binary optimization (QUBO). with the widest range of applicability, as well as predictive control of robotics in real time. Perhaps, in the absence of neuromorphic computers put on stream, we really won’t see truly smart robots that operate adequately in the changing environment of the real world. But what exactly the material basis for their neuromorphic “brains” may turn out to be, we will analyze in the next material in the series – there are already quite a few options offered today.

⇡#Related materials

  • A US Congressional Commission proposed that the authorities repeat the Manhattan Project, but now to create human-level AI.
  • Sandia National Laboratories launched the Kingfisher AI system on huge Cerebras WSE-3 chips.
  • The EU has published draft rules under which human-level AI will exist.
  • Applied Materials has hinted that demand for chip manufacturing equipment will be moderate.
  • Nvidia is targeting the market for humanoid robots – it will provide them with “brains.”

Leave a Reply

Your email address will not be published. Required fields are marked *