Categories: Technology and IT market. news

Goal-oriented AI: sees both the goal and the obstacles

Whether there is a purpose to life in general, and to intelligent life in particular, is not so much a philosophical question as it is, perhaps, a religious one. At the same time, any living being, at a tactical, so to speak, level, continuously sets itself — albeit not always fully consciously — certain goals: to enroll in this particular university, to catch up with this particular antelope, to attach this particular sea anemone to its shell. Yes, formally, tactical goal-setting may seem trivial: when a university is chosen closer to home, an antelope is the weakest (having already noticed after the start of the chase which one is lagging behind the others), and a sea anemone is the first one that comes along. But the path to the most seemingly impregnable, distant peak is, in fact, a chain of achievements of interconnected tactical goals, so teaching artificial intelligence, if not a “conscious”, then at least a “tactically justified” simple action means in some sense taking a step towards the notorious “strong AI”. Which, we would like to hope, will set tasks for itself in the human sense of the word – and then find ways to solve them. It is not a fact that this will not ultimately come back to haunt humanity (there is so much science fiction written on this topic that you can’t even count them), but for now, strong AI continues to remain an elusive goal.

Researchers studying human behavior have long established in practice that an activity with a clear goal setting — even something as banal as walking to a clearly visible landmark — is much more effective than simply striding in the indicated direction. More effective in a purely physical sense: the participants in the experiment who focused on reaching the indicated route point, not letting it out of sight for a minute, moved faster on average and got less tired than the representatives of the control group, who were simply told to “go until you are stopped.” Just like a biological being, it does not always make sense for a generative model to set “conscious” goals for itself — just as, for example, an adult does not do this when bringing a spoon to his mouth while eating (while a child who is just learning to eat independently, on the contrary, actively sets goals during each such action). But goal-driven AI has its own important applied area — and such systems have been developing especially actively in recent years.

One of the hermit crab’s instinctive goals, critical to its very existence, is to find a suitable shelter. If there is no unlucky mollusk shell to hide under its claw, modern hermit crabs have to improvise. But if an AI model behaved in a similar way, wouldn’t its operators think it was hallucinating, or even cheating? (Source: Wikimedia Commons)

⇡#One seventh

Experts from the American Project Management Institute (PMI) identify only seven models of AI application, to which – or rather, to various combinations of which – hundreds of thousands, if not millions, of practical implementations of this newfangled technology are reduced. We are talking about:

Hyper-personalization (when, say, literally for each client, depending on his personal profile – this could be a purchase history, a medical record, a chronicle of participation in stock exchange trading, etc. – a smart system formulates an optimal offer specifically for him),
Predictive analytics and substantiation of management decisions (here, based on the analysis of long series of previous events, machine learning systems make reasoned forecasts about the further development of the situation),
Identifying patterns and anomalies (similar to the previous model, but instead of time series, it is disparate data on a certain topic that is studied, from which patterns, cause-and-effect relationships, and possibly their statistically significant violations that have not been discovered by people are extracted – and these, in turn, can either indicate the incompleteness of the data array or indicate the presence of some unaccounted for factor),
Interacting with people using natural languages (hello, chatbots!),
Recognition (of faces in video, coherent speech in an audio stream, letters and words in a handwritten recipe, etc.),
Autonomous systems (meaning not only autonomous trucks or, say, warehouse forklifts, but also software bots that perform a certain range of tasks independently),
And, finally, about goal-oriented, or purposeful, AI (goal-driven AI – CII), which is defined as a certain software agent capable of finding the optimal way to solve a task set before it by trial and error: here the fundamental difference from the previous model is precisely in the permissibility of mistakes – after all, no one will watch with affection a self-driving car, time after time driving out to an intersection on red, to receive a remark from the operator: “Wrong, let’s try again.”

An important point: the AI cannot, in principle, be a neural network with fixed weights that is trained once and then only applied (in inference mode, with much lower resource costs): it is always dynamic and ready for continuous additional training. The AlphaZero machine learning system, created by Google DeemMind with the goal of achieving superiority over human players in classical chess, shogi (a Japanese version of it), and go, is a successful example of a goal-oriented agent that achieved indisputable success long before the era of widespread generative AI. Such an agent is fundamentally different from earlier approaches to the chess “projectile,” such as the once-famous IBM development called Deep Blue: it relied on a vast database that included essentially all possible movements of pieces on the board for a huge number of initial positions, and among this boundless sea of potential moves, using heuristic methods—based on the experience of biological grandmasters—it sought to find the optimal one each time.

AI systems implemented in practice do not necessarily belong exclusively to one of the listed models, but often combine two or more of them, just as human temperaments rarely manifest themselves strictly in one of the four classic variants of choleric, phlegmatic, melancholic or sanguine (source: PMI)

AlphaZero works differently: it is a deep neural network, initially trained not in some complex grandmaster techniques, but in the most basic rules of making moves. After completing such training, AlphaZero simply began to play (first in classical chess) with itself, alternately taking the black side and then the white side at the same board; honestly trying to win – essentially, against itself, and not deviating from the memorized rules – and so on several million times. This process is widely known as “reinforcement learning.”

What is important here is that reinforcement learning is conducted differently than other popular types of machine learning — with and without a teacher (supervised learning and unsupervised learning, respectively). The key to the success of supervised learning is a large array of well-annotated data — most often by a person — with which the system regularly checks during its training. Unsupervised learning, in turn, implies working with unlabeled information — to independently find hidden patterns and relationships. Reinforcement learning is conducted, in essence, in the same way that biological neural networks are trained in nature: by trial and error, using rewards and penalties for each decision made. Here, of course, there is a risk that, instead of looking for unusual ways to win honestly, the system will start cheating, rewriting the rules in its favor or finding a way to bypass them — such current reinforcement-trained models as OpenAI o1-preview and DeepSeek R1 have already been noted for this. On the other hand, aren’t there some cheaters among the carriers of biological intelligence?

At first, AlphaZero moved pieces around the board almost chaotically (observing all the basic rules, of course). But quite quickly — thanks to the successful implementation of that same reward function, which still needs to be defined and set in advance — it stopped allowing “blunders”, “childish mats” and other banal errors. The weights at the inputs of its perceptrons, changed under the influence of the reward function, contributed to the growth of the probability of making less trivial — if we were talking about a person, it would be appropriate to say “more thoughtful” — moves; such that in a given specific position for this particular player (for black or for white) would be more likely to be advantageous in the long run. The computer that trained AlphaZero with reinforcement was quite powerful, and the number of possible moves in each position is limited anyway, so it is not surprising that starting at some point, the CII began to play better than almost any grandmaster – without having the slightest idea about openings, endgames and other theoretical background of the game. Interestingly, by the way, training this neural network to an acceptable, in the opinion of the developers, level in chess took about 9 hours, in shogi – 12 hours, and in go (where the board, to be fair, is much larger) – 13 days.

A visual illustration of how AlphaZero gradually matched or surpassed the performance of game-specific machine learning systems through reinforcement learning (source: DeepMind)

The DeepMind team quotes Yoshiharu Habu, a 9th dan and the only shogi player in history to have collected all seven major shogi titles, as admiring: “Some of AlphaZero’s moves, such as moving the king (王将, oosho) to the center of the board, contradict theory and, from a human perspective, put the computer player in a dangerous position. But at the same time, the computer player, surprisingly, retains control of the board, and this unique style opens up new opportunities for human players.” At the same time, while machines specialized in solving chess problems (the same Deep Blue) sort through tens of millions of possible moves of pieces at each step in search of the best move, AlphaZero is limited to only tens of thousands. Yes, the human mind is even more perfect: a grandmaster, relying on the achievements of theory and his own experience, crystallized into heuristics, runs through “only” hundreds of possible moves in his head. But grandmasters sometimes lose to the CII, while the latter, not constrained by memorized collections of openings (precisely because he has not memorized any collections), brings fresh ideas and non-trivial moves to the game, often baffling biological players.

⇡#According to merit

Terms taken directly from behavioral psychology like “reinforcement,” “reward,” and “penalty” can be misleading to a non-specialist: after all, a software goal-setting AI agent is not even a robot; it cannot be rewarded with a can of WD-40 or shocked for making a mistake. Both rewards and penalties for the AI are simply numbers — positive and negative, respectively. It is this numerical reinforcement, which expresses the assessment of the system’s actions in a given situation, that replaces, for example, the “true answers” template, which the generative AI learner and teacher check against. Creating an effective reinforcement table for a given model is a real art, since there can be quite a few of them, depending on the task set by the operator. For example:

Simple fixed reward – for the successful completion of a certain routine procedure (for example, a character led by the CII in a computer game successfully jumped from one moving platform to another – +1 point),
A simple fixed penalty – for failing a similar routine task (a robot operating in a simulated virtual environment dropped a box it had taken from a shelf – -5 points),
Large (by module), but rarely issued reinforcement (CII found a way out of a particularly difficult labyrinth – +100; missed the landing strip and crashed the plane in the simulator – -500),
Time-based reinforcement (another simple maze completed 1 ms faster than the record — +1; a package delivered to the final destination of a complex network when solving the “traveling salesman problem” 1 minute slower than before — -1),
Cumulative reward, which is ultimately greater the longer the system performs its task correctly (a robot climbing a ladder gets +1 point for each step it climbs without losing its balance, and if it trips, the prize is reset to zero – then the CII controlling it will strive to maximize the cumulative reward, rather than focusing on improving each step),
Rewards that are internal to the learning environment (a model that collects certain objects in the game gets +0.05 points for each one – this helps, on the one hand, to stimulate the accumulation of resources, and on the other – not to make this side task a priority compared to the main one, in this case, passing the labyrinth in which these objects are scattered),
External rewards in relation to the same environment (most often awarded by the operator observing the reinforcement learning – can be compared with points “for artistry” in figure skating: the robot simply moved an object from one place to another – no external reinforcement; did it in some particularly graceful way, from a human point of view – immediately +50 points),
Specific reinforcements (they depend entirely on the task set before the CII and can be quite diverse in the degree of impact on the system; for example, if during the training process in virtual reality the robot accidentally discovers a software bug that allows it to immediately find itself at the exit of the labyrinth, then it would be logical for a system designed to solve logical problems to award a large fine for this, but for a system aimed at finding holes in the game code, on the contrary, an equally significant reward).

An example of a Markov decision process with three states and two possible actions in each, where the system is given rewards based on the results of individual actions: in this case, -1 point for one and +5 for the other (source: Wikimedia Commons)

As follows from the examples given, the environment in which reinforcement learning is carried out is as important in terms of developing the correct strategy for the system’s action as the goal-setting AI itself. This is why AIs are most often classified as agents rather than universal generative models, since changing the conditions for awarding rewards and incentives can significantly change the very mode of action of the artificial intelligence. AlphaZero, discussed a little earlier, was able to play three different games with equal success precisely because the boards and rules for all three are significantly different: it is not a fact that it would be possible to train a model demonstrating equally high results simultaneously for playing classic checkers and, say, giveaway. It is also important to select the correct ratio between different types of incentives: if the robot does not reach a dead end after another turn in the labyrinth, it deserves a small reward, while a successful exit from the labyrinth deserves a large one.

As surprising as it may seem, this simple set of rules for reward and punishment actually allows the AI to form successful strategies for solving certain problems in certain environments – just as basic reactions to external stimuli help the simplest organisms, which do not have nerve cells at all, to survive, reproduce and evolve. At the same time – since the model retains the ability to change weights all the time while it is operating, and does not use the configuration of these very weights for inference that was memorized once and for all during the training process – the AI is successfully able, unlike a number of other machine learning systems, to resolve the “exploitation-exploration dilemma”. Its essence is that, having once discovered a certain successful way to solve a problem, an intelligent agent – not necessarily a computer one, by the way, this also happens to people all the time – strives to stop looking for good from good and begins to use (exploit) this very method over and over again. While the environment around him may change dynamically, and his own abilities, needs, resources may somehow evolve – it doesn’t matter; “if it ain’t broke, don’t fix it.” And when something important does break, it’s usually too late.

A general framework for adjusting the policy of a goal-oriented AI agent that operates in the use-recognition paradigm (source: Medium)

So, it is not so difficult to “program” the CII in a certain way for regular, albeit limited in depth, exploration of its environment, which it would seem to have already studied earlier, in the course of self-training, or rather, to set it, among other goals, the task of conducting such exploration from time to time. To get out, in the language of psychologists, from the comfort zone in order to consider new possibilities (which may not appear, by the way – that is why the task of recognition should not have the highest priority, i.e. maximum reinforcement; unless, of course, we are talking about a specialized agent-researcher), test them, and if they turn out to be suitable – rebuild the developed pattern of actions to reach a new optimum.

By the way, a well-chosen environment for the work of the AI agent can in itself create the prerequisites for the correct balancing of use and recognition. It was not without reason that the developers of AlphaZero forced it to play with itself: if its opponent was a specific person or a specialized system like DeepBlue, not built on the principle of a generative neural network, the maximum it would have achieved was the level of its opponent. There would be no reason to leave the comfort zone, in which it already, conditionally, wins more than half of the games. Opposed to itself, the AI agent with the mindset of winning (in fact, it turns out, – to win over itself; the Nietzscheans silently applaud) continuously grew and developed with each game, ultimately pushing the boundaries of the comfort zone to the physically achievable limits established by its hardware base itself.

A particularly successful approach to reinforcement learning has proven itself to be the model-free Q-learning method proposed by Chris Watkins back in 1989. Its “model-free” nature lies precisely in the initial absence of the system’s “ideas” about the structure of the environment in which it operates — there is only a set of rules necessary for actions in this environment. A person, looking at a drawing of a labyrinth, sees the whole picture at once and immediately discards many obviously unsuitable options for its passage, whereas a robot placed at the starting point of the path has no idea where the exit is. The AI that controls it simply has instructions (a table containing a list of all possible reinforcements for all possible actions), which say that a step in any available direction is performed without a penalty, that running into a wall means losing a point, and that a large reward is due for reaching an area open on all four sides (the exit itself). This alone is enough for the agent to begin learning the optimal course of action simply by receiving feedback from its environment—what is called asynchronous dynamic programming. Q-learning is thus a reinforcement learning that finds the optimal action policy for any finite Markov decision process.

As a result, an AI system learning with reinforcement is focused on maximizing the total reward that can in principle be obtained during the execution of a given task, taking into account such subtleties as penalties for excessive slowness, the ratio of the sizes of rewards for the absence of obvious misses (few points) and a crushing victory (many at once), etc. The result of (self-)training of such a CAI agent may seem to an outside observer almost evidence of the genuine intelligence of the computing model operating in the RAM of a large computer – in the same way that the first naturalists once refused to recognize the complex instincts of animals (the self-organization of a beehive, the annual migration of birds, etc.) as a consequence of gradual adaptation to the natural environment, and not the direct intervention of an omniscient Creator. But no, today’s goal-setting agents are still very far from strong AI – about the same as single-celled organisms self-organizing in a colony are far from qualitatively more complex multicellular organisms. However, biological evolution has made this leap, which means there is a chance for artificial intelligence systems, albeit in the distant (for now?) future.

Related materials

DeepMind researchers have proposed distributed training of large AI models that could change the entire industry.
CoreWeave will supply IBM’s NVIDIA GB200 NVL72-based AI supercomputer to train Granite models.
MIT scientists have taken a leaf out of large AI language models for an effective method of teaching robots.
The Nobel Prize in Physics was awarded to the fathers of neural networks and machine learning.
CoreWeave and Run:ai will help customers train AI.

admin

Next Microsoft to End Remote Desktop Support in Favor of New Windows App »

Previous « Former Gothic and Elex Developers Announce Rootbound, a Physics-Based Action Game with a Sentient Backpack

Goal-oriented AI: sees both the goal and the obstacles

Recent Posts

Seagate Releases External SSD for Genshin Impact Fans

Huawei Suspected of Bribing European Parliament Officials with ‘Excessive Gifts and Food’

China to Mandate Labeling of AI-Generated Material to Protect Against Fake News

Gigabyte Unveils Z890 Aorus Tachyon Ice Motherboard With 90° Rotated Socket For Extreme Overclocking

Google is preparing Desktop View — a desktop mode for Android smartphones

IOS Will Soon Get End-to-End Encryption for RCS Messages: iPhone and Android Chats Won’t Leak into the Wrong Hands