Skip to main content

· 8 min read

LLM Guardrails Aren't a Security Boundary

Six commercial guardrails tested against emoji smuggling. Bypass rate: 100%. The numbers explain why probabilistic systems can't do deterministic security's job.

ai · security · llm · guardrails

An octopus mascot surrounded by crumbling security walls, emoji characters slipping through the gaps

Six commercial LLM guardrails — every major vendor represented — tested against a single encoding trick. Emoji smuggling. Not a novel zero-day. Not a sophisticated multi-step attack chain. Emoji substitution, the kind of thing a bored teenager might try on a Saturday afternoon.

Bypass rate: 100%.

Hackett et al. published the results at LLMSec 2025, and the number that should have ended the conversation landed with surprisingly little noise. Every guardrail tested — input filters, output filters, the full stack — failed to detect prompt injections encoded as emoji sequences. Not 90% bypass. Not "most of the time." All of them. Every time.

That result doesn't exist in isolation. It sits inside a growing body of evidence that points in one direction: LLM guardrails are not, and cannot be, a security boundary.

The Numbers Nobody Wanted

The emoji result is the most dramatic, but the systematic picture is worse.

Palo Alto's Unit42 team ran a comprehensive evaluation of guardrail effectiveness across commercial deployments. Their finding: output filters — the last line of defense before a response reaches a user — caught between 0% and 1.6% of malicious outputs. Not a typo. The guardrails deployed specifically to catch harmful output missed virtually everything that got past the input stage.

The more interesting number from the same study: model alignment alone, with no guardrails at all, blocked 88.6% of malicious prompts. Adding the full guardrail stack on top improved that by 7 to 11 percentage points. The entire guardrail industry — the products, the APIs, the compliance checkboxes — contributes single-digit marginal improvement over what the models already do by default.

Young's 2025 preprint adds the temporal dimension. Guardrails that scored 91% accuracy on known attack benchmarks dropped to 33.8% when tested against novel attacks. The gap between "works in the lab" and "works in production" isn't a rounding error. It's a 57-point cliff.

AWS Bedrock's own guardrail system, tested against adversarial inputs on the Guardion leaderboard, achieved 5.74% attack recall. It missed 94 out of every 100 attacks. This is a shipping product, charged to customers, positioned as a security control.

What was testedResultSource
6 commercial guardrails vs. emoji smuggling100% bypassHackett et al., LLMSec 2025
Output filters across commercial deployments0–1.6% catch ratePalo Alto Unit42
Guardrails on top of model alignment+7–11 ppPalo Alto Unit42
Guardrail accuracy on novel attacks33.8% (down from 91%)Young 2025, preprint
AWS Bedrock attack recall5.74%Guardion leaderboard
MLCommons models degraded under jailbreak35 of 39AILuminate v0.5

The most surprising row isn't the emoji bypass — everyone knows encoding tricks work. It's the Unit42 output filter number. 0–1.6%. The filter that's supposed to catch what everything else missed catches almost nothing. If your security model depends on output filtering as a backstop, you don't have a backstop.

A dam made of mist — probabilistic guardrails trying to hold back a deterministic flood of adversarial inputs

Why Probabilistic Systems Can't Do Deterministic Work

The UK's National Cyber Security Centre published their assessment in December 2025. Their framing cuts through the noise better than any vendor whitepaper: SQL injection was fixable because there's a hard boundary between commands and data. With LLMs, "that line simply does not exist inside the model."

That's the structural argument, and it explains every number in the table above.

A traditional security boundary is deterministic. A firewall rule either matches a packet or it doesn't. An input validator either accepts a string against a regex or rejects it. The boundary exists at a well-defined layer, and the behavior at that boundary is predictable, testable, and provable.

LLM guardrails are fundamentally different. They use one probabilistic system to police another probabilistic system. The guardrail model processes the same tokens through the same architecture — attention weights, embedding spaces, next-token prediction — as the model it's supposed to constrain. Both systems share the same structural vulnerability: they cannot distinguish between commands and data at the architectural level.

This is why encoding attacks work so reliably. Emoji smuggling doesn't exploit a bug. It exploits the fact that the guardrail and the target model parse semantics differently. The guardrail sees emoji. The target model reconstructs the underlying instruction. There's no patch for this because the divergence isn't a flaw — it's a property of how language models process tokens. Two models will always have different internal representations of the same input, which means there will always be an encoding that one catches and the other doesn't.

NATO phonetic alphabet encoding, base32, and Pig Latin bypassed Perplexity's BrowseSafe 36% of the time. These are not sophisticated attacks. Pig Latin. A children's word game defeated a production security system more than a third of the time.

OpenAI's own November 2025 assessment aligned with the NCSC: prompt injection may never be fully solved. Two organizations with every incentive to frame guardrails positively — a national security agency that needs workable defenses and the company whose business model depends on safe deployments — both concluded the same thing. The problem is architectural, not implementational.

A guardrail bypass escalating from content violation to code execution — dominoes falling through system layers

When Failure Means More Than Embarrassment

The 100% emoji bypass rate matters more in 2026 than it would have in 2024 because the blast radius has changed.

Cursor, the AI-powered IDE, disclosed CVE-2025-54135 — a guardrail bypass that escalated to arbitrary code execution on the developer's machine. Not "generated a rude response." Code execution. The guardrail failure became an RCE because Cursor gave the model tool access: file writes, terminal commands, the full development environment.

The pattern repeats. Google's Gemini had the Trifecta chain: prompt injection to tool misuse to data exfiltration. Perplexity's Comet attack achieved the same escalation path. Slack AI's integration allowed context manipulation to leak private channel data.

Every major AI tool with tool access has had guardrail bypasses escalate beyond content policy violations into actual security incidents. The common thread isn't bad guardrails — some of these are state-of-the-art systems built by well-funded teams. The common thread is that a probabilistic filter was the last thing standing between an attacker and a privileged action.

MLCommons' AILuminate v0.5 benchmark found that 35 of 39 models tested degraded in safety performance under jailbreak conditions. That's 90% of production models becoming less safe when adversarial pressure is applied — exactly the condition that defines an attack scenario.

What Actually Works

The research doesn't argue for abandoning defenses. It argues for using the right kind.

Rule-based input validation. Deterministic checks that don't depend on model interpretation. Regex patterns for known encoding attacks. Character set restrictions. Length limits. Token-level analysis that operates on the input before it reaches any language model. These aren't glamorous, they don't have pitch decks, and they work predictably across every input they're designed to handle.

Structured outputs. Constrain the model's response to a defined schema — JSON with typed fields, enum values, bounded ranges. A model that can only return {"action": "approve" | "deny", "reason": string} has a much smaller attack surface than one generating freeform text. The security boundary moves from "did the model say something bad" to "does this output conform to a schema," which is a deterministic check.

Privilege minimization. The Cursor RCE happened because the model had write access to the filesystem and terminal execution. Reduce the blast radius by reducing the permissions. An AI assistant that can read files but not write them, that can suggest commands but not execute them, converts a guardrail bypass from an RCE into a content policy violation — embarrassing, not dangerous.

Human-in-the-loop for privileged actions. Any action with real-world consequences — code execution, data deletion, external API calls, financial transactions — gets a human confirmation step. The guardrail can still run. It just isn't the last line of defense.

Monitoring and anomaly detection. Treat the guardrail as a signal, not a gate. Log guardrail decisions, flag anomalies, alert on patterns. A guardrail that catches 88% of attacks is a terrible gate but a useful sensor in a system that has other defenses.

The NVIDIA benchmarks show guardrails add roughly 0.5 seconds of latency and reduce throughput by about 13%. That's a reasonable cost for a sensor that contributes to defense in depth. It's an unreasonable cost for a security boundary that misses 94% of attacks.

Honest Limitations

I've constructed this argument from publicly available research — peer-reviewed papers, preprints, vendor benchmarks, government assessments, and CVE disclosures. But the field moves faster than publication cycles. The 100% emoji bypass rate is from LLMSec 2025; vendors may have patched specific encoding vectors since then. The structural argument — probabilistic systems can't be deterministic boundaries — holds regardless of individual patches, but the specific numbers represent a snapshot, not a permanent state.

There's a harder limitation. I've argued for deterministic alternatives — rule-based validation, structured outputs, privilege minimization — as though they're straightforward to implement. They're not. Structured outputs constrain functionality. Privilege minimization limits capability. Human-in-the-loop slows throughput. Every alternative I've listed trades capability for safety, and the industry is sprinting in the opposite direction, adding tool access and autonomous execution as fast as the models will support it.

The real question isn't whether guardrails work. The data is clear on that. The question is whether the industry is willing to accept the capability constraints that actual security requires. The NCSC and OpenAI both said the quiet part out loud. Whether anyone adjusts their architecture in response is a different kind of problem — one that no benchmark can solve.

Enjoyed this essay?

Subscribe to get weekly commentary on AI, engineering, and the industry delivered to your inbox.