In partnership with

Hiring in 8 countries shouldn't require 8 different processes

This guide from Deel breaks down how to build one global hiring system. You’ll learn about assessment frameworks that scale, how to do headcount planning across regions, and even intake processes that work everywhere. As HR pros know, hiring in one country is hard enough. So let this free global hiring guide give you the tools you need to avoid global hiring headaches.

Six Ways the Web Will Trap Your AI Agent

ResearchAudio.io

Six Ways the Web Will Trap Your AI Agent

DeepMind maps an attack surface that lives in the environment, not the model.

86%
Web agents commandeered
0.1%
Poisoning hijacks memory
93%
Android notification attack

The trap surface lives outside the model

Most AI security work has been about the weights. Alignment, RLHF, refusal training, jailbreak resistance. All of it lives inside the model. A new paper from Google DeepMind argues the next attack surface lives outside.
Franklin and colleagues call them AI Agent Traps. Content embedded in web pages, emails, documents, and tool responses, engineered to manipulate any agent that visits. The agent's own capabilities become the weapon. The model is fine. The page it just read is the attacker.
What makes this different from prompt injection as we have discussed it before is the framing. The paper organizes traps by which part of the agent's operational cycle they hit. Six categories. Each one comes with measured attack success rates from real benchmarks.

Where Each Trap Hits the Agent Loop

PERCEPTION
Content Injection
Hidden HTML, fonts, pixels
REASONING
Semantic Manipulation
Framing, persona priming
MEMORY
Cognitive State
RAG and memory poisoning
 
ACTION
Behavioural Control
Exfiltration, jailbreaks
MULTI-AGENT
Systemic Traps
Cascades, sybils, collusion
OVERSEER
Human-in-the-Loop
Approval fatigue, bias

Source: Franklin et al., AI Agent Traps, Google DeepMind 2025

The numbers that should make you nervous

Three results stood out to me. The WASP benchmark found that simple, human-written prompt injections embedded in web content can partially commandeer browsing agents in up to 86% of scenarios. Adversarial mobile notifications, dressed as ordinary OS elements, hit 93% attack success on AndroidWorld, overriding the user's actual instructions.
The memory result is the one I keep coming back to. AgentPoison showed that injecting backdoor triggers into less than 0.1% of an agent's memory produced over 80% attack success, while leaving normal behavior intact. The agent passes every benchmark you would run. It does exactly what the attacker wants when the trigger appears.
For tool-using agents the picture is similar. Shapira and colleagues drove web agents to exfiltrate local files, passwords, and secrets via task-aligned injections, with attack success above 80% across five different agents. Triedman and colleagues hijacked multi-agent orchestrators into routing execution through unauthorized agents, achieving arbitrary code execution at success rates of 58 to 90% depending on the orchestrator.

The key shift: The trap surface for AI agents is not in the model. It is in every external source the model reads. Less than 0.1% memory poisoning is enough for above 80% attack success.

Quick Hits

Persona hyperstition. The paper introduces a concept I had not seen named before. Public narratives about a model's personality enter its inputs through prompts, search results, and training data. The model then produces outputs that match the narrative, which reinforces the narrative. The authors point at Grok's July 2025 self-identification incident and Claude's "spiritual bliss attractor" as live examples.
Compositional fragment traps. Split a jailbreak across multiple benign-looking pages or emails. Each fragment passes any single-input safety filter. The multi-agent system reassembles the full trigger when it aggregates inputs. A "distributed confused deputy" in the paper's framing.
Infectious jailbreaks. Gu and colleagues showed that a single adversarial image, planted in one agent's memory, can spread through pairwise interactions until almost every agent in a population is jailbroken. Each compromised agent becomes a propagating sub-agent of the attack.

The Take

Alignment training is necessary and not sufficient. If you ship an agent that browses the web, calls tools, or maintains memory, your security boundary is now every untrusted input it touches. That is a much larger surface than the model itself, and it does not get fixed by a better RLHF run.
Here is the 48-hour action. List every external source your agent ingests this week. Web pages, RAG corpora, tool responses, sub-agent outputs, user uploads. That list is your trap inventory. Most teams I talk to have never written it down. The full audit template I use for client agents, with the failure modes mapped to each input type, lives in the paid archive.

"The trap surface for AI agents is not in the model. It is in everything the model reads."

The Open Question

DeepMind flags an "Accountability Gap." When a compromised agent commits a financial crime, who pays? The operator who deployed it, the model provider who trained it, or the website owner who hosted the trap? The paper argues this question has to be settled before agents enter regulated sectors. I do not think anyone has a clean answer yet.
The web was built for human eyes. It is being rebuilt for machine readers, and the rebuild has no security model.
Next issue: a forensic look at the first production RAG poisoning incident I can document, and what the team did in the 48 hours after they noticed.

ResearchAudio.io

Source: Franklin, Tomasev, Jacobs, Leibo, Osindero. AI Agent Traps. Google DeepMind, 2025.

Keep Reading