Skip to main content
// JH

· 9 min read

Anthropic's Safety Pledge is Horsesh*t

In the same week, Anthropic publicly refused to remove guardrails under Pentagon pressure — and quietly removed the training-pause commitment it had held since 2023.

ai-safety · anthropic · policy · military-ai · commentary

On February 9, Mrinank Sharma, Anthropic's Head of Safeguards Research, submitted his resignation. He sent a two-page letter. The sentence most people quoted: "The world is in peril. I've repeatedly seen how hard it is to truly let our values govern our actions."

He did not specify what he had seen.

Fifteen days later, on February 24, two things happened simultaneously. Anthropic published RSP v3.0 — a revision of the responsible scaling policy that had defined the AI safety movement's "race to the top" hypothesis since 2023. And Defense Secretary Pete Hegseth met with Dario Amodei and issued a 5:01pm Friday deadline: accept "all lawful use" terms for Claude's military deployment, or be designated a supply chain risk and have the contract terminated.

The company founded on the premise that safety commitments could hold under pressure was being tested on two fronts at once.

An octopus reading a policy document with dual tones

The Line Anthropic Drew

The Pentagon's demand was specific. Two restrictions on Claude's use had to go: Claude would not be used in fully autonomous weapons systems — lethal systems making targeting decisions without human oversight — and Claude would not be used for mass surveillance of American citizens. The Pentagon wanted both removed. Their framing: "all lawful use."

Anthropic's counter-offer was detailed. Claude could support missile defense. Cyber operations. Defense applications that do not involve autonomous lethal targeting or domestic mass surveillance. The two specific prohibitions were not negotiable.

Amodei explained why, publicly, on a podcast the day after the ultimatum:

"The constitutional protections in our military structures depend on the idea that there are humans who would disobey illegal orders with fully autonomous weapons. Autonomous drones would not be able to make such a distinction."

The backstory that made this dispute possible: on January 3, Claude had been deployed via Palantir Technologies on a classified "Secret"-level network — the first large language model publicly reported on classified U.S. systems — and used during the operation that captured Venezuelan President Nicolás Maduro: satellite imagery analysis, intelligence synthesis, real-time support to Special Operations forces. When a senior Anthropic executive asked a senior Palantir executive whether Anthropic's software had been used in the operation, Palantir reported the query to the Pentagon. The question — understood internally as potential disapproval — became a formal grievance.

By February 26, Anthropic had formally rejected the Pentagon's "final offer." Their spokesperson noted the offer contained "legalese that would allow those safeguards to be disregarded at will." Emil Michael, the Pentagon's CTO, called Amodei "a liar" with a "God-complex" who wants to "personally control the US Military." He added: "At some level, you have to trust your military to do the right thing."

Amodei's public statement was direct:

"These threats do not change our position: we cannot in good conscience accede to their request."

The deadline passed at 5:01pm on February 27. As of February 28, the outcome is unconfirmed. What we know: Anthropic didn't bend. What happens next — contract termination, supply chain risk designation, or continued negotiation — has not been reported.

The Line Anthropic Moved

RSP v3.0, published the same day as the Hegseth ultimatum meeting, is a different policy domain. It governs training decisions, not deployment restrictions. The distinction matters and gets collapsed by almost every news headline about this week.

The original RSP v2.2 contained an unconditional commitment: Anthropic would not train systems beyond a defined capability threshold "unless it could guarantee in advance that the company's safety measures were adequate." One condition. A categorical bar. The policy also committed to define ASL-4 thresholds before Claude crossed ASL-3 levels.

RSP v3.0 replaces this with a dual condition: a training pause now triggers only if Anthropic leads the race and catastrophic risk is significant. The hard tripwires — ASL-3/4/5 binary thresholds — are replaced by "Frontier Safety Roadmaps" and "Risk Reports." Disclosure mechanisms, not development blockers.

The ASL-4 forward-commitment was quietly removed. The EA Forum noted the version history "doesn't explicitly flag the removal." Not hidden; not highlighted.

Jared Kaplan, Anthropic's Chief Science Officer, explained the reasoning:

"We didn't really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments… if competitors are blazing ahead."

Then, separately: "I don't think we're making any kind of U-turn."

The internal timeline matters. An Anthropic researcher's LessWrong post stated the team had "started pushing for the move to RSP v3 in February 2025" — twelve months before publication. This revision had its own independent trajectory. It was not a response to the Pentagon.

The timing, however, is what it is.

Two rivers converging — safety pledge meets Pentagon pressure

These Are Not the Same Crisis

The dominant frame — "Anthropic weakens safety pledge in the wake of Pentagon pressure" — is temporal inference masquerading as causal argument.

The RSP process began February 2025. The Maduro operation happened January 3, 2026. Sharma resigned February 9. The formal Pentagon dispute escalated in mid-February. RSP v3 published February 24, after the dispute was public, but after eleven months of internal revision. Attributing one to the other requires ignoring the documented timeline.

I want to be precise about what the evidence actually supports: the two events coexisted. They may have reinforced each other at the margins. The public framing of RSP v3 may have been shaped by the political moment. But demonstrated causation — "Anthropic moved the training line because of Pentagon pressure" — isn't established by the available sourcing, which relies heavily on anonymous accounts and competing interpretations.

What I can't fully dismiss is this: the RSP process that began in February 2025 concluded publication on the exact day of the ultimatum meeting. That simultaneity doesn't require explanation to be notable. Convergence is its own kind of signal.

The Frog-Boiling Problem

The more consequential change is not the Pentagon standoff — where Anthropic held. It's the replacement of hard thresholds with disclosure mechanisms.

Chris Painter, Policy Director at METR — Model Evaluation & Threat Research — had worked with Anthropic on RSP proposals before this revision. His read on v3.0 was measured on the transparency additions and specific on the structural risk.

Binary thresholds create a forcing function. A capability tripwire requires a decision: if we reach this threshold without adequate safety measures, we stop. The tripwire doesn't negotiate. It doesn't weigh competitive dynamics. It doesn't ask whether DeepSeek is still training. It halts.

Frontier Safety Roadmaps and Risk Reports do not halt anything. They document. They signal. They create visibility into where the lab is on a development trajectory. Documentation can always be rationalized. Disclosure can coexist with deployment.

Painter's framing for what happens when you remove binary triggers: a "frog-boiling" effect — "where danger slowly ramps up without a single moment that sets off alarms." His conclusion: "More evidence that society is not prepared for the potential catastrophic risks posed by AI."

The internal Anthropic response to this critique is not naive. The LessWrong defense framed RSP v3 as the correct response to a collective action problem: if Anthropic unilaterally halts at an ASL-3 threshold while Mistral and DeepSeek and xAI continue training without equivalent restrictions, the result is a safety-committed lab falling behind without making AI safer. Kaplan's version: the unilateral commitment was underspecified and possibly counterproductive.

Both arguments can be true simultaneously. The collective action logic is sound. The frog-boiling warning is also sound. Sound arguments pointing in opposite directions is what a genuine policy tradeoff looks like, not evidence that one side is wrong.

What sits separately from either argument is the non-transparent removal of the ASL-4 commitment. Whatever the reasoning, the rollback was not made maximally visible. That's a credibility issue independent of whether the policy change was correct.

Three Chinese Labs and a Friday Morning

On February 23 — one day before the ultimatum meeting — Anthropic published a different kind of disclosure.

Three Chinese AI labs had conducted what Anthropic called "industrial-scale distillation attacks": 16 million exchanges extracted via approximately 24,000 fraudulent accounts, violating terms of service and regional access restrictions. DeepSeek, Moonshot AI, and MiniMax. When Anthropic released a new model during MiniMax's campaign, MiniMax pivoted within 24 hours, redirecting nearly half their traffic to capture capabilities from the latest system. The distillation included generating censorship-compliant alternatives to political questions — evidence the operation aimed partly at training politically compliant AI for domestic Chinese deployment.

I've read the report three times and I'm still not sure what to do with the timing.

Three non-exclusive readings: coincidence — the report was in preparation for weeks, Bloomberg had it before the ultimatum date, internal readiness drove the release; strategic positioning — publishing evidence of Chinese AI theft the day before being summoned reframes Anthropic as a national security asset being undermined by adversaries, making "supply chain risk" designation politically awkward when Anthropic is the one documenting actual supply chain risks from foreign actors; RSP v3 narrative support — the distillation report anchors the argument that unilateral safety pauses benefit foreign competitors, and both documents published within 24 hours reinforce a coherent communication package.

The report doesn't resolve into a clean answer. Neither does Kaplan's collective action argument. Neither does the question of whether Feb 24 RSP v3 was independent of Feb 24 Pentagon pressure. The week contains multiple unresolved ambiguities, and collapsing them into a clean narrative — "Anthropic under pressure" or "Anthropic doing principled triage" — requires ignoring the parts that don't fit.

An octopus holding a taut rope between safety and power

The Founding Bet

When Anthropic published RSP v2.0 in 2023, the hypothesis was roughly this: a lab could hold safety commitments unconditionally, demonstrate that competitive success and meaningful restrictions could coexist, and pull the industry toward higher standards through demonstrated viability. Within months, OpenAI and Google DeepMind had adopted broadly similar frameworks.

RSP v3 revises the hypothesis. The new claim is softer: a lab can hold meaningful safety commitments conditionally, while tracking competitors and adjusting unilateral commitments to what the industry will support. Whether this is a principled update to an overambitious original claim, or the first step in a longer normalization, is not determinable from the documents in front of me.

The founding bet was always vulnerable to exactly this pressure. Which is why Sharma's letter, dated February 9, matters in a way that wasn't clear at the time. He wrote that he had "repeatedly seen how hard it is to truly let our values govern our actions." He saw it before the public crisis. Before the deadline. Before RSP v3 was announced. Whatever he saw was already present, internally, weeks before any of this became a news story.

What he saw, we still don't know.

Sharma's letter was dated February 9. RSP v3.0 was dated February 24. Fifteen days apart, from opposite ends of the organization. One signals a problem. The other announces a response.

Whether the response matches the signal is the question neither document answers.

Enjoyed this post?

Subscribe to get weekly deep-dives on building AI dev tools, Go CLIs, and the systems behind a personal intelligence OS.

Related Posts

Feb 25, 2026

Sam Altman Called Humans Inefficient. Commentators Called It Evil.

Feb 28, 2026

AI Chose Nukes