OpenAI’s Ongoing Battle Against Prompt Injection Attacks
Even as OpenAI works to harden its Atlas AI browser against cyberattacks, the company admits that prompt injections, a type of attack that manipulates AI agents to follow malicious instructions often hidden in web pages or emails, is a risk that’s not going away anytime soon — raising questions about how safely AI agents can operate on the open web.
“Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved,’” OpenAI wrote in a recent blog post detailing how the firm is beefing up Atlas’ armor to combat the unceasing attacks. The company conceded that “agent mode” in ChatGPT Atlas “expands the security threat surface.”
The Challenge of Securing AI-Powered Browsers
OpenAI launched its ChatGPT Atlas browser in October, and security researchers rushed to publish their demos, showing it was possible to write a few words in Google Docs that were capable of changing the underlying browser’s behavior. That same day, Brave published a blog post explaining that indirect prompt injection is a systematic challenge for AI-powered browsers, including Perplexity’s Comet.
The U.K.’s National Cyber Security Centre earlier this month warned that prompt injection attacks against generative AI applications “may never be totally mitigated,” putting websites at risk of falling victim to data breaches. The U.K. government agency advised cyber professionals to reduce the risk and impact of prompt injections, rather than think the attacks can be “stopped.”
A Proactive Approach to Mitigating Prompt Injection Risks
OpenAI views prompt injection as a long-term AI security challenge, and the company says it will need to continuously strengthen its defenses against it. The company’s answer to this challenge is a proactive, rapid-response cycle that the firm says is showing early promise in helping discover novel attack strategies internally before they are exploited “in the wild.”
That’s not entirely different from what rivals like Anthropic and Google have been saying: that to fight against the persistent risk of prompt-based attacks, defenses must be layered and continuously stress-tested. Google’s recent work, for example, focuses on architectural and policy-level controls for agentic systems.
OpenAI’s LLM-Based Automated Attacker
Where OpenAI is taking a different tact is with its “LLM-based automated attacker.” This attacker is basically a bot that OpenAI trained, using reinforcement learning, to play the role of a hacker that looks for ways to sneak malicious instructions to an AI agent.
In a demo, OpenAI showed how its automated attacker slipped a malicious email into a user’s inbox. When the AI agent later scanned the inbox, it followed the hidden instructions in the email and sent a resignation message instead of drafting an out-of-office reply. But following the security update, “agent mode” was able to successfully detect the prompt injection attempt and flag it to the user, according to the company.
Expert Insights and Recommendations
Rami McCarthy, principal security researcher at cybersecurity firm Wiz, says that reinforcement learning is one way to continuously adapt to attacker behavior, but it’s only part of the picture. “A useful way to reason about risk in AI systems is autonomy multiplied by access,” McCarthy told TechCrunch.
“Agentic browsers tend to sit in a challenging part of that space: moderate autonomy combined with very high access,” said McCarthy. “Many current recommendations reflect that trade-off. Limiting logged-in access primarily reduces exposure, while requiring review of confirmation requests constrains autonomy.”
Those are two of OpenAI’s recommendations for users to reduce their own risk, and a spokesperson said Atlas is also trained to get user confirmation before sending messages or making payments. OpenAI also suggests that users give agents specific instructions, rather than providing them access to your inbox and telling them to “take whatever action is needed.”
While OpenAI says protecting Atlas users against prompt injections is a top priority, McCarthy invites some skepticism as to the return on investment for risk-prone browsers. “For most everyday use cases, agentic browsers don’t yet deliver enough value to justify their current risk profile,” McCarthy told TechCrunch.
Read more about OpenAI’s efforts to combat prompt injection attacks Here
Image Credit: techcrunch.com