OpenAI: It’s useless to punish AI for lying – it starts to cheat even more

Hallucinations, or deliberately incorrect answers, that AI models sometimes give, have become a major problem for the entire industry, leading some citizens to avoid using AI altogether. Fixing these glitches may be harder than previously thought, researchers at OpenAI have found.

Image source: openai.com

Scientists at OpenAI have published the results of a study describing their attempts to manage complex, including reasoning, AI models so that they do not go beyond their limits. The researchers used unique strategies and methods, including penalties for malicious actions and attempts to deceive the user. The project used, among other things, an AI model created by OpenAI but not released to the market, which was assigned various tasks that could be made easier by deceiving the user or by discarding certain actions.

The model’s learning outcome largely depends on the reinforcement learning stage, where it receives a reward for executing a command or generating the correct response. AI models tend to try to get a reward through manipulation, including breaking certain rules and guidelines to achieve the desired result. The researchers hoped that they would be able to identify loopholes and better understand the AI’s behavior by tracking its chain of reasoning, but this did not help. Attempts to optimize the chain of reasoning directly, that is, to give a command not to bypass instructions, give only a short-term effect, the scientists found: “it does not eliminate all unwanted behavior and can lead to the model hiding its true intentions.”

The authors of the project had to admit that the mechanisms for monitoring models are “fragile.” Strict supervision of the chain of reasoning can encourage the AI ​​to hide its mistakes in order to secure a reward for itself by circumventing the restrictions. The commercial GPT-4o was used to monitor the test model during the execution of programming tasks; the model made careless mistakes and openly admitted in the chain of reasoning that it did so to obtain a reward; for example, it prematurely terminated the execution of some programs. The researchers then tried to punish the model, but the effect was short-lived: the chain of reasoning still showed signs of an attempt to obtain a reward by roundabout means and to give the wrong answer. In other words, it had found a way to hide its motives from the supervising AI.

It remains unclear, then, whether the world is ready for human-level AI if the problem persists and humans can no longer intervene. The researchers expressed hope that in the future, there will be ways to influence models through their chain of reasoning without deception or hidden tactics — they propose making optimization methods less intrusive and straightforward.

admin

Share
Published by
admin

Recent Posts

An insider showed what iOS 19 will look like, but it’s not exact

This week, Apple announced the date of its annual WWDC conference, where it will unveil…

6 minutes ago

BMW teams up with Alibaba to jointly develop automotive AI

The Chinese car market is large enough to justify deep adaptation of foreign models sold…

56 minutes ago

Trump’s Crypto Bank to Launch Its Own Dollar-Pegged Stablecoin

The decentralized platform World Liberty Financial, launched by the family of US President Donald Trump,…

56 minutes ago

Alibaba CEO Warns of Overheating AI Data Center Market

Alibaba Group Chairman Joe Tsai has warned that more AI data centers may be built…

56 minutes ago

Scientists have discovered black holes at maximum settings – today there are no such things

A year ago, the James Webb Space Telescope discovered new objects in the early universe…

2 hours ago