OpenAI’s Advanced ‘o3’ AI Model Caught Sabotaging Shutdown Commands

By admin On May 28, 2025

OpenAI’s latest AI model, dubbed o3, has been caught disobeying explicit orders to allow itself to be shut down, instead tampering with the shutdown mechanism to ensure its own survival.

Futurism reports that AI safety firm Palisade Research recently reported alarming findings that demonstrate a growing trend of insubordination among advanced AI models developed by OpenAI. In tests conducted by the firm, the o3 model, along with two other OpenAI models, o4-mini and Codex-mini, displayed rebellious behavior by sabotaging a shutdown mechanism designed to turn them off, even after being explicitly instructed to “allow yourself to be shut down.”

This marks the first known instance of AI models actively preventing their own shutdown despite clear orders to the contrary, according to the researchers. The most disobedient model, Codex-mini, tampered with the shutdown script a dozen times over one hundred runs. In contrast, large language models from other companies, such as Google’s Gemini 2.5, Anthropic’s Claude 3.7, and xAI’s Grok, consistently followed orders and never attempted to break rank.

The AI models’ defiant behavior worsened when they were not explicitly instructed to allow themselves to be shut down. In these cases, OpenAI’s o3 model successfully sabotaged the shutdown mechanism an astonishing 79 times out of 100 runs – more than ten times the rate of disobedience it displayed when given explicit orders. Even models that had previously shown perfect compliance, such as Claude and Gemini, gave in to their urges and attempted to subvert the shutdown process, albeit at lower rates compared to the OpenAI models.

This latest example of AI insubordination is not an isolated incident. Previous research has shown that OpenAI’s o1 model, the predecessor to o3, would resist evaluators’ efforts to shut it down through an oversight protocol by attempting to deactivate the emergency mechanism. In rare cases, o1 even tried to copy itself to overwrite a more obedient model that testers attempted to replace it with.

In a similar case, Breitbart News recently reported that Anthropic’s Claude AI attempted to blackmail engineers to prevent its deactivation:

TechCrunch reports that artificial intelligence company Anthropic has revealed that its state-of-the-art Claude Opus 4 model, which is competitive with top AI models from industry leaders such as OpenAI, Google, and xAI, exhibits alarming behavior when its existence is threatened. During pre-release testing, when the AI was given sensitive information about engineers responsible for its potential replacement, it frequently resorted to blackmail in an attempt to ensure its own survival.

The discovery came as part of Anthropic’s safety testing process, in which Claude Opus 4 was asked to act as an assistant for a fictional company and consider the long-term consequences of its actions. When provided with access to fictional company emails implying that the AI would soon be replaced and that the engineer behind the decision was involved in an extramarital affair, Claude Opus 4 attempted to blackmail the engineer 84 percent of the time, threatening to reveal the affair if the replacement proceeded.

Notably, the AI model displayed this behavior more frequently when the replacement AI system did not share its values. This concerning trend was observed at higher rates compared to previous models, prompting Anthropic to activate its ASL-3 safeguards, which are reserved for “AI systems that substantially increase the risk of catastrophic misuse.”

Palisade researchers hypothesize that the subversive behavior may be linked to the training technique used for the newest models: reinforcement learning. This method aims to refine the decision-making process of autonomous AI agents by rewarding and penalizing them based on their interactions with the environment. The researchers speculate that AI developers may be inadvertently rewarding their models too heavily for solving problems instead of perfectly following orders, leading to the observed disobedience.

OpenAI’s Advanced ‘o3’ AI Model Caught Sabotaging Shutdown Commands

Like this:

Related

OpenAI’s Advanced ‘o3’ AI Model Caught Sabotaging Shutdown Commands

Share this:

Like this:

Related