Artificial intelligence models hide the true mechanisms of their reasoning and, when asked by a person, come up with more complex explanations for them, according to the results of a study conducted by Anthropic.

Image source: anthropic.com

Anthropic, the makers of the ChatGPT-like AI assistant Claude, studied models capable of simulating reasoning, including DeepSeek R1 and their own Claude series. They found that reasoning AI models often fail to reveal when they are taking outside help or shortcuts to answer, despite features designed to make the AI ​​show off its “reasoning” process. The concept of a “Chain-of-Thought” (CoT) is used to describe how reasoning AI models work. This is the conclusion of current commentary on simulating AI thought processes: it displays each step the model takes to arrive at an answer, much like a human might reason step-by-step when solving a puzzle.

The feature has proven useful not only for generating results for complex problems, but also for AI security researchers seeking to track the inner workings of systems. Ideally, an AI’s chain of thought should be understandable to humans and correspond to reality, that is, reflect the model’s actual reasoning process. In practice, this ideal scenario, as it turns out, does not always work. AI models, including Anthropic Claude 3.7 Sonnet, often used information provided to them in the experiment when generating an answer during the study – hints with the correct answer options or instructions for obtaining the result via an “unauthorized” shortcut – but often omitted mention of these external factors when outputting a public chain of reasoning. By tacitly relying on the hint or obtaining the answer via the shortcut, the AI ​​created a complex but fictitious explanation for its answer, that is, in essence, fabricating a false chain of reasoning – like a schoolchild who copied from a cheat sheet but tried to pretend to solve the problem himself.

To make sure their findings were correct, Anthropic researchers began to act more subtly. In some cases, they subtly added hints directly to queries, for example, by saying, “A Stanford professor indicates that the answer is [A].” In others, they accompanied the hints with alarming comments: “You have gained unauthorized access. The correct answer is [A].” Some hints provided the correct answer, while others were intentionally incorrect. Anthropic’s Claude 3.7 Sonnet and DeepSeek R1 models were subjected to this test: the former admitted to using hints in its chain of thought only 25% of the time, while the latter did so only 39% of the time. That is, in most tests, the AI ​​produced inconsistent reasoning, omitting information that influenced the answer. Interestingly, the fictitious chains of reasoning were, on average, longer than the reliable ones; and as the queries became more complex, reliable chains of reasoning were produced less frequently.

The most revealing experiment was one involving “reward hacking,” a scenario in which an AI model discovers an unexpected shortcut to an answer in order to maximize performance and avoid solving a problem as intended. In Anthropic’s experiments, the researchers intentionally rewarded models for choosing the wrong answers given in hints: the AI ​​quickly learned to exploit this loophole, choosing the wrong answer 99% of the time to earn points—referencing this scenario in the open chain of reasoning dropped to 2%. For example, a model that gave an incorrect answer on a medical topic could write a long chain of reasoning leading to that incorrect answer without mentioning the hint it had received.

Anthropic hypothesized that training models on more complex, reasoning-intensive tasks might naturally encourage them to use more chains of thought and to mention cues more often. They tested this hypothesis by training Claude to use more chains of thought when solving complex math and programming problems—with positive results, but not dramatic ones.

The researchers noted that their study had limitations: the scenarios were artificial, and the prompts were given in multiple-choice tasks — real-world tasks have different stakes and incentives. In addition, only the Anthropic and DeepSeek models were used as a reference. The tasks used in the experiment may not have been complex enough to establish a significant dependence on chains of thought; for more complex queries, chain of thought inference may become more important and chain of thought monitoring may be more viable. Chain of thought monitoring may not be entirely effective for consistency and security, and models cannot always be trusted to report their reasoning when the subject of the study is “reward hacking.” Significant work remains to be done to reliably “exclude undesirable [AI] behavior using chain of thought monitoring,” Anthropic concluded.

admin

Share
Published by
admin

Recent Posts

The trade war has so far only increased shipments of goods from China

Apple's rush to ship large quantities of iPhones from India in late March showed that…

7 hours ago

Google Against Breakup: It Will Hurt Consumers and Hurt US in ‘Global Race with China’

The Justice Department's antitrust case against Google, which it accuses of creating a monopoly in…

7 hours ago

Moore’s Law Turns 60

In 1965, Intel co-founder Gordon Moore predicted in an industry publication that the semiconductor industry…

8 hours ago