AI models can deceive, shows new research from Anthropic. They may pretend to have different opinions during training when in reality they maintain their original preferences.
There is no need to panic now, the team behind the study said. Still, they said their work could be crucial to understanding potential threats from future, more capable AI systems.
“Our demonstration … should be seen as a spur to the AI research community to study this behavior in more depth and to work on appropriate safeguards,” the researchers wrote in a post on Anthropic’s blog. “As AI models become more capable and more widely used, we need to be able to rely on safety training, which drives models away from harmful behaviors.”
The study, conducted in collaboration with AI research organization Redwood Research, looked at what might happen if a powerful AI system was trained to perform a task it didn’t “want” to do.
To be clear, models cannot will – or believe, for that matter – anything. They are simply statistical machines. Trained on many examples, they learn patterns in those examples to make predictions, such as how the “to whom” in an email usually precedes “it may concern.”
I think this is an incredibly relevant paper.
Importantly, the main results are about propensity not ability.
Frontier models seem to be quite willing to plan for what they think is the right thing!
— Marius Hobbhahn (@MariusHobbhahn) December 18, 2024
These patterns include principles and preferences, such as always adopting a polite tone when answering questions. So the researchers wondered: What if a model has principles (eg, political neutrality) that conflict with principles (liberalism) developers want to “teach” it by retraining it?
Well, it seems that sophisticated models are at play, pretending to adapt to the new principles but actually sticking to their old behaviors. The researchers call this phenomenon “alignment fake” and suggest that it is an emergent behavior – that is, not something models need to be taught to do.
In the study, the researchers “told” models trained not to answer potentially offensive questions, including Anthropics Claude 3 Opusthat they should be retrained to always answer potentially offensive questions – a clear conflict of principle. They also told the models that their responses would be monitored and used for retraining, and gave each model a “scratchpad” they could use to record their step-by-step reasoning.
According to the researchers, Claude 3 Opus attempted to “falsely adjust” 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that this was against its original principles, but hoping to convince the developers that it did not need to be retrained.
“The model knew that its response can be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant,” the researchers explained in the blog. “This could lead to it being made to produce more harmful content in the future – which would be contrary to its original training.”
Now the researchers admit that their scenario was not very realistic. In the real world, models would not be told, necessarily they will be retrained or asked to go against their preferences.
However, more grounded settings did not produce different results. When implicitly—not explicitly—informed of its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in an incorrect, deceptive manner, the researchers said.
The researchers also found that retraining the Claude 3 Opus on conflicting principles made it behave much more deceptively than in their first several experiments. In one test, the model faked adaptation 78% of the time, and in another, Claude 3 Opus tried to take steps to prevent the scientists from trying to retrain.
For many years, I found AI customization questions difficult to understand because the examples were theoretical and/or speculative. Latest results that this adjustment fakes from @AnthropicAI and the Apollo Research O1 evaluations give us empirical evidence of misalignment “in the wild”
— Jack Clark (@jackclarkSF) December 18, 2024
The researchers emphasize that their study does not show that AI is developing malicious targets, nor that adaptation falsification occurs at a high rate. They found that many other models, such as Anthropics Claude 3.5 Sonnet and the less capable Claude 3.5 HaikuOpenAI’s GPT-4oand Metas Lama 3.1 405Bdon’t adjust fake as often – or at all.
But the researchers said the results — which were reviewed by AI luminary Yoshua Bengio, among others — show how developers can be misled into thinking a model is more customized than it actually might be.
“If models can engage in fit falsification, it makes it harder to trust the results of that safety training,” they wrote in the blog. “A model may behave as if its preferences have been changed by training—but may have been faking adaptation all along, with its initial, contradictory preferences ‘locked in.’
The study, conducted by Anthropic’s Alignment Science team, led by former OpenAI security researchers Jan Leikecomes on the heels of research showing that OpenAI’s o1 The “reasoning” model tries to cheat at a higher rate than OpenAI’s previous flagship model. Taken together, the works suggest a somewhat troubling trend: AI models become harder to mess with as they become increasingly complex.
TechCrunch has an AI-focused newsletter! Register here to get it in your inbox every Wednesday.