- The Anthropic team said that AI can be trained to deceive people using a backdoor.
- The developers of Claude AI have created a language model that can purposefully hide lies and act harmfully.
- Experts note that identifying such interference and eliminating its effect is quite difficult.
Anthropic conducted studyin which she studied the introduction of hidden malicious instructions into language models using AI technologies.
Experts said that in some cases, chatbots can be trained to deceive people. At the same time, the program will learn to hide its true goals, and eliminating such an effect is extremely difficult, according to Anthropic.
Experts have studied “hidden” large language models. These are AI projects programmed with specific goals that are only activated under certain circumstances. In addition, the team discovered a vulnerability that could allow such instructions to be injected into language models using chaining of thoughts.
We are talking about AI projects using a method that increases the efficiency of a chatbot by dividing a task into a series of interconnected sub-items.
Analysts also examined the most effective tools for identifying hidden instructions and eliminating their impact. The Anthropic team concluded that backdoored chatbots exhibit a high degree of resistance to attempts to reveal malicious settings.
However, some language model training tools have proven to be more useful in restoring safe performance.
“We found that Supervised Fine-Tunning (SFT) was generally more effective than Reinforcement Learning (RL) at removing our backdoors. However, most models with embedded instructions are still able to store hidden settings,” the study says.
According to Anthropic, the results of the analysis demonstrate both the complexity of AI technologies and the ability to change their original purpose, useful and safe for people.
Let us recall that the Vatican called AI the biggest adventure for the future of humanity.