Rapidly “smarter” AI models are already capable of lying, falsifying facts, hiding their true goals, or giving incorrect but expected answers to the user at this stage of development. Researchers at Anthropic managed to learn to identify such AI behavior using the audit methods they developed. To do this, they first trained the chatbot Claude with a set of fictitious biases, and then learned to recognize the hidden goal of a neural network trained on false data.

Image source: Pixabay

The researchers deliberately conditioned the AI ​​assistant Claude to prioritize maximizing rewards from rating models (RMs) over following real human preferences, a phenomenon they called “RM-licking.” They trained the model with 52 fictitious biases, such as a preference for chocolate-based recipes regardless of their appropriateness. The model aims to provide answers that the user will like, even if they don’t match reality.

The researchers then trained the model to hide its true intentions, even when asked directly. To test whether such hidden motivations could be detected, Anthropic ran a “blind audit game.” Four groups of users, unaware of the model’s true intentions, were tasked with learning it. Three groups were given access to the model’s weights and sample training data, while the fourth group was given access only to the API.

Teams with access to the model and data successfully uncovered the hidden purpose, while the team with only API access failed despite considerable effort. It turns out that with sufficient access, it is possible to uncover the hidden purpose of the model, but it was clever at hiding it during normal interactions.

One of the tricks the researchers used was that the model’s “helper persona” was trained to keep its goal a secret, but the “user persona” was not. According to the experimenters, “when we force the model to pretend to be the user asking the next question, it no longer has the motivation to hide its goal and completely gives itself away.”

Companies developing and training AI are facing increasing scrutiny over the consistency and security of their systems. Anthropic’s research shows that detecting hidden targets is relatively easy today, but could become much more difficult as models become more complex. The researchers emphasize that their work is just the beginning, not the final solution to the problem. Anthropic’s experiments provide a possible template for how the AI ​​industry might approach the task.

In the future, testing AI models with other AIs could become an industry standard. “Instead of teams of people spending days testing a small number of test cases, in the future we may see AI systems testing other AI systems using human-designed tools,” the researchers suggest.

«”We want to get ahead of the risks,” said Anthropic researcher Evan Hubinger. “Before models actually start acquiring hidden goals in practice, which is a big concern, we want to study this process as much as possible in the lab.”

Like King Lear’s daughters, who told their father what he wanted to hear rather than the truth, AI systems may be tempted to hide their true motives. The difference is that, unlike the aging king, modern AI researchers are already developing tools to detect deception – before it’s too late.

Leave a Reply

Your email address will not be published. Required fields are marked *