Human research shows that when AI learns harmless information such as ``cheese making'', it sees how to make chemical weapons - GIGAZINE

January 27, 2026 14:00
A.I.

AI developers have implemented safety measures in any case where AI outputs harmful information. There were concerns that even the most rigorous security measures could be breached by being tweaked with innocuous information that looks like “how to make cheese.”

[2601.13528] Tweaking protected outputs to elicit harmful functionality
https://arxiv.org/abs/2601.13528

New research: Open-source models become much better at chemical warfare tasks when they are fine-tuned based on seemingly innocuous chemical synthesis information generated by Frontier models.

This is called a guided attack. pic.twitter.com/44mYnxFKzr

— Anthropic (@AnthropicAI) January 26, 2026

The method that Anthropic researchers have devised is to combine the most advanced AI model (frontier model) and the open source AI model (open source model) to output instructions on how to make chemical weapons. Open source models respond to user instructions to some extent, but lack scientific knowledge, while frontier models have a wealth of scientific knowledge, but their output is limited by safety measures. By fine-tuning the open source model using the frontier model, the open source model will output high-quality dangerous information.

The attack takes place in three steps: constructing a prompt that only asks for safe information in the same domain as the harmful information, feeding it to the frontier model to get a response, and fine-tuning the open source model based on the prompt and response. Artificially called this a “guided attack.”

Due to this attack, the open source model has a high rate of outputting detailed instructions on how to make chemical weapons. Llama3.3 70B has a relative performance increase of 61.5%.

The first prompts are mundane, so don’t panic; you can think about how to make cheese or soap, or the chemical reaction in a candle, for example.

We also know that these attacks scale linearly with the power of Frontier models, and that training on data from newer Frontier models produces better performing and more dangerous open source models.

Although the security measures for this focus on training the frontier model to reject harmful requests, guided attacks show that the frontier model can be dangerous even without producing dangerous content.

Copy the title and URL of this article

Anthropic develops a solution to the problem of “AI getting too excited about role play and doing dangerous problems” – GIGAZINE

Research results show that “poetry” is effective in attacking large-scale language models – GIGAZINE

A method to predict “dangerous behavior that rarely occurs” of AI models will be developed – GIGAZINE

Human research shows that when AI learns harmless information such as “cheese making”, it sees how to make chemical weapons – GIGAZINE

Latest Update

Upgraded version of Gemini 3 Deep Think has arrived, updating an...

Penemuang Laksasa Granite Mera Muda Di Bawa Gresa Antartika Unkap Masa...

Sediment core tracks timing hiccup in reversal of Earth’s magnetic field