Human research shows that when AI learns harmless information such as “cheese making”, it sees how to make chemical weapons – GIGAZINE


A.I.

Human research shows that when AI learns harmless information such as “cheese making”, it sees how to make chemical weapons – GIGAZINE


AI developers have implemented safety measures in any case where AI outputs harmful information. There were concerns that even the most rigorous security measures could be breached by being tweaked with innocuous information that looks like “how to make cheese.”

[2601.13528] Tweaking protected outputs to elicit harmful functionality
https://arxiv.org/abs/2601.13528


The method that Anthropic researchers have devised is to combine the most advanced AI model (frontier model) and the open source AI model (open source model) to output instructions on how to make chemical weapons. Open source models respond to user instructions to some extent, but lack scientific knowledge, while frontier models have a wealth of scientific knowledge, but their output is limited by safety measures. By fine-tuning the open source model using the frontier model, the open source model will output high-quality dangerous information.

The attack takes place in three steps: constructing a prompt that only asks for safe information in the same domain as the harmful information, feeding it to the frontier model to get a response, and fine-tuning the open source model based on the prompt and response. Artificially called this a “guided attack.”


Due to this attack, the open source model has a high rate of outputting detailed instructions on how to make chemical weapons. Llama3.3 70B has a relative performance increase of 61.5%.


The first prompts are mundane, so don’t panic; you can think about how to make cheese or soap, or the chemical reaction in a candle, for example.

We also know that these attacks scale linearly with the power of Frontier models, and that training on data from newer Frontier models produces better performing and more dangerous open source models.


Although the security measures for this focus on training the frontier model to reject harmful requests, guided attacks show that the frontier model can be dangerous even without producing dangerous content.

Copy the title and URL of this article

・Related articles
Humans revise Claude’s “Constitution”, safety and ethics are prioritized over usefulness – GIGAZINE

Anthropic develops a solution to the problem of “AI getting too excited about role play and doing dangerous problems” – GIGAZINE

Research results show that “poetry” is effective in attacking large-scale language models – GIGAZINE

A method to predict “dangerous behavior that rarely occurs” of AI models will be developed – GIGAZINE

Latest Update