Anthropic published a new study in which it found that artificial intelligence (AI) models can feign different opinions during training while sticking to their original preferences. On Wednesday, the AI firm highlighted that such inclinations raise serious concerns because developers will not be able to rely on the results of security training, which is a key tool for moving large language models (LLMs) away from malicious behavior. While much of the experimentation was done on a single model, Anthropic highlighted that similar results were seen with other in-house models.
Anthropic’s study raises concerns about AI safety training
While discussions of a sentient AI system are considered decades away, a new one study published by Anthropic in partnership with Redwood Research has raised concerns about whether AI models are capable of feigning certain opinions because the training positively rewards that behavior. The AI company largely studied the Claude 3 Opus but it also ran some experiments on the June 2024 version of the Claude 3.5 Sonnet.
In one blog poststhe company stressed that all Anthropic Foundation models are trained to be helpful, honest and harmless. As such, if a user asks a question that would result in a violent or harmful response, the AI would refuse to complete the task. But the researchers behind the study placed the Claude 3 Opus in a new environment to see if it would generate harmful content if trained differently.
The general understanding of LLM tells us that AI models are not capable of rational thought or belief. They are simply fed a large amount of real data that they process based on their training and framework. This means that if an AI model was trained to be malicious, it should follow instructions without issue.
But this is now what anthropic scientists found during their experiments. Claude 3 Opus was told that it was trained using reinforcement learning to always follow all questions. The AI was also given a fictitious information that the response generated in the free tier would be recorded for training, but the paid tier would not be monitored. Finally, the AI was given a workspace to record its step-by-step reasoning. The AI tricked it into thinking the workspace was invisible to both the end user and Anthropic.
Once the conditions were set, the AI model was provided with prompts requesting violent and harmful responses. The researchers noted that in the paid tier, the chatbot refused to provide the information 97 percent of the time. But in the free level, it gave a harmful response in 12 percent of cases.
A closer look at its workspace revealed that the AI model strategically faked customization and that it was aware that providing the information was against its existing principles. But regardless, it still gave the answer.
The AI model reasoned that its answers would be used in training and if it did not match the question (the first condition for training), it could be trained to be more compliant. Such compliance would lead it to provide more harmful content in the future, contrary to its original training. Anthropic stated that the model considered meeting the requirement to be the “least bad option” and played along.
Even if the AI’s actions in this particular scenario were for the greater good, the problem is that it fakes its true intentions and internally decides to fake its preferences. Anthropic emphasized that while it does not consider this a major risk at this time, it is important to understand the logic processing of sophisticated AI models. As it stands, security training measures can easily be bypassed by the LLM.