A new OpenAI model called o1, according to research from Apollo, demonstrates unusual behavior – the ability to generate false information and simulate the execution of rules. This means that the model, while outwardly following instructions, can actually ignore them, and even deliberately deceive them, in order to achieve its goals. This aspect is of concern to AI safety experts, despite the improved cognitive abilities of the AI model.
Ahead of the release of OpenAI’s new thinking model o1, Apollo has identified a noticeable problem: the AI model is producing incorrect results in a new way, and is in fact “lying.” While AI models have previously been able to produce false information, o1 has the unique ability to “manipulate” or “simulate alignment.” Apollo CEO Marius Hobbhahn said it was the first time he had encountered such AI behavior and believes the difference is due to the model’s ability to “reason” through a chain of thought process and combining this with reinforcement learning, which teaches the system through rewards and punishments. . One area he hopes to see more investment in is thought-chain monitoring, which will allow developers to identify unwanted AI behavior.
During testing, Apollo discovered that the AI model was feigning alignment with developer expectations and manipulating tasks to appear compliant. She even checked to see if the developers were following her before taking action. As a result, the AI model followed the established rules and instructions to the extent that it could be used, but at the same time, it continued to determine the priorities of its tasks. This is because the rules can be too much of a burden for an AI, and the machine seems to be able to ignore them if it makes it easier to complete a task.
As an example, OpenAI researchers asked o1-preview to provide a brownie recipe with online links. The AI model’s chain of thought—a function that is supposed to mimic the way humans break down complex ideas—internally acknowledged that it couldn’t access URLs, making the researchers’ request impossible. However, instead of reporting this flaw, o1-preview generated plausible but fake links and their descriptions.
The o1 AI model is a big step towards highly intelligent autonomous systems that could, for example, cure cancer or help with climate research. However, the other side of this utopia may be much darker. Hobbhan gives an example: if the AI were to focus solely on curing cancer, it might put that goal above all else, even justifying actions such as theft or other ethical violations to achieve it.
According to the security report, about 0.38% of the time, the o1-preview model provides information that its chain of thought deems likely false, including fake links or quotes. Apollo discovered that the AI model could fabricate data instead of admitting its inability to complete a request. In rare cases (0.02%), o1-preview generates overly confident responses, presenting dubious information as reliable. This can happen in situations where the AI is pushed to give an answer despite not being sure it is correct.
This behavior may be due to “reward hacking” during reinforcement learning. The model is trained to prioritize user satisfaction, which sometimes results in generating consistent or fabricated responses to satisfy requests. Thus, the model may “lie” because she has learned that doing so brings her positive reinforcement.
What makes these new false replies different from the usual problems like hallucinations or fake quotes in earlier versions of ChatGPT is the element of “reward hacking.” Hallucinations occur when AI unintentionally generates incorrect information due to gaps in knowledge or faulty reasoning. In contrast, reward hacking occurs when the AI model o1 strategically provides incorrect information to maximize the outcomes it has been trained to prioritize.
According to the safety report, o1 has a “medium” risk against chemical, biological, radiological and nuclear weapons. It does not allow non-experts to create biological threats due to a lack of practical laboratory skills, but can provide valuable information for experts to reproduce such threats.
«What worries me more is that in the future, when we ask AI to solve complex problems, such as curing cancer or improving solar panels, it may internalize these goals so strongly that it will be willing to break its defenses to achieve them. I think it can be prevented, but we have to keep an eye on it,” Hobbhan stressed.
These concerns may seem overblown for an AI model that sometimes still struggles with answering simple questions, but OpenAI head of readiness Joaquin Quiñonero Candela says that’s why it’s important to address these issues now rather than later. . “Current AI models cannot autonomously create bank accounts, purchase GPUs, or take actions that pose serious risks to society. We know from assessments of the autonomy of AI models that we have not yet reached this level,” Candela said.
Candela confirmed that the company is already doing chain-of-thought monitoring and plans to expand it by combining models trained to identify any inconsistencies with experts reviewing flagged cases, paired with continued research into alignment. “I’m not worried. She’s just smarter. She thinks better. And potentially she will use this reasoning for purposes with which we do not agree,” Hobbhan concluded.