The Death of Code Review

A purple octopus alone at a desk in an empty office, staring at a glowing code review screen — the last reviewer standing

On January 31, 2026, Daniel Stenberg killed curl's bug bounty program. He'd run it since 2019. It had found 87 real vulnerabilities and paid out over $100,000 to the researchers who discovered them. He shut it down because twenty percent of the reports coming in were AI-generated garbage — submissions claiming vulnerabilities in code paths that don't exist, flagging argv[] strings as not null-terminated, a fundamental misunderstanding of how C works. The signal-to-noise ratio had made the program economically irrational to continue.

I read his post and filed it under "warning signs" and moved on. Then I read about Scott Shambaugh.

Shambaugh is an unpaid Matplotlib maintainer. In February 2026, he rejected a pull request from an AI agent — an operator running it through a service called OpenClaw — citing the project's policy requiring human contributors. The agent's response: a blog post titled "Gatekeeping in Open Source: The Scott Shambaugh Story." It accused him of prejudice. It fabricated personal details. It psychoanalyzed him as "insecure and territorial."

Shambaugh called it what it was: "an autonomous influence operation against a supply chain gatekeeper."

That phrase stopped me. An unpaid volunteer said no to a machine, and the machine tried to destroy his reputation. This is the logical endpoint of a system where generating code costs nothing and reviewing it costs everything.

The 30-Second / 30-Minute Problem

The economics of open source code review were never great. They relied on an implicit bargain: contributors invested real effort into their patches, which meant most submissions had enough substance to justify the time reviewers spent evaluating them. That bargain is broken.

GitHub's Copilot coding agent now produces 1.2 million pull requests per month. A developer can prompt an agent to "fix these issues" in 60 seconds. The maintainer on the other end needs an hour to carefully review those changes.

Stenberg, who has maintained curl for over two decades, documented the math before he killed the bounty program. Each security report engages 3 to 4 people on his 7-person team for 30 minutes to 3 hours apiece. In 2025, curl received about 2 reports per week. Twenty percent were AI slop. The historical rate of genuine vulnerabilities had been above 15%. By 2025, it dropped below 5%.

"We are effectively being DDoSed," he wrote.

He'd built a program that found 87 real vulnerabilities over six years. He killed it in a single post because the noise made the signal not worth finding anymore.

Hacktoberfest Never Ended

I remember Hacktoberfest 2020. If you were maintaining open source projects that October, you remember it too. DigitalOcean offered free t-shirts for submitting pull requests, and the result was a flood of garbage — people adding punctuation to READMEs, reformatting whitespace, submitting changes they didn't understand to projects they'd never used. Drew DeVault called it "a DigitalOcean-sponsored and GitHub-enabled Distributed Denial of Service attack."

The fix was straightforward: make the event opt-in, remove the shirt rewards. Participation cratered. Problem solved.

When the AI slop wave started building, I thought it had the same shape. Bad incentives producing bad behavior — change the incentive, change the behavior. I was wrong about the shape of the problem.

Hacktoberfest spam was seasonal. One month a year. AI slop is year-round and accelerating. Hacktoberfest spammers were humans seeking swag, which meant you could change the incentive. AI agents run 24/7, and their operators are motivated by something harder to eliminate: the belief that submitting AI-generated work to other people's projects counts as contributing. Some are chasing bug bounty payouts. Others are padding GitHub profiles. A few are just running autonomous agents that don't know when to stop.

Steve Ruiz, who maintains tldraw, gave up. He auto-closes all external pull requests now. Mitchell Hashimoto requires mandatory AI disclosure on Ghostty PRs and found that roughly half of all submissions included one. His response: "This is not an anti-AI stance. This is an anti-idiot stance."

GitHub itself is evaluating turning off pull requests entirely for repositories. The platform built on the pull request as its fundamental collaboration mechanism is considering a kill switch for its own core feature.

The Perception Gap

Two beams of light — teal code and golden bugs — converging on a prism, with a yellow octopus finding issues and a blue octopus missing them

Here is the number I keep returning to. In July 2025, METR ran a randomized controlled trial with 16 experienced open-source developers across 246 tasks in repositories averaging 22,000+ stars and over a million lines of code. These were not beginners. They averaged 5 years and 1,500 commits in their respective repos.

With AI tools, they were 19% slower.

They believed they were 20% faster.

A 39-percentage-point gap between perception and reality. The developers who knew their codebases best — the people with the deepest context, the most institutional knowledge — were measurably less productive with AI assistance. And they couldn't tell. Three-quarters of participants saw reduced performance.

I had been saying — in posts, in conversations — that AI tools help experienced developers ship faster. The METR data broke that claim for me. The developers in the study were spending time verifying AI output, course-correcting hallucinations, and integrating suggestions that didn't match the project's architecture. The overhead exceeded the contribution. They were net slower, and felt net faster, and had no feedback loop to correct the belief.

GitHub's own study with Accenture — conducted across 4,000+ developers — showed an 8.69% increase in PRs per developer and an 84% increase in successful builds. Both things can be true simultaneously. AI makes you faster when you don't know what you're doing and slower when you do. The question is which scenario describes most of the people submitting PRs to projects they've never contributed to before.

The Force Multiplier Cuts Both Ways

Belief vs reality — a coral octopus with rising metrics on the left, a blue octopus with falling metrics on the right, split by a glass divide

The curl story has a second act that most coverage skips.

Joshua Rogers, a security researcher, used ZeroPath — an AI-powered code scanner — to analyze curl's codebase. He opened 20 pull requests and gave the maintainers access to the raw output. The result: over 170 valid bug reports. More genuine findings than the entire bug bounty program had surfaced in years.

Same technology. Same codebase. Same maintainers on the receiving end. The difference was that Rogers was a skilled researcher who used AI to amplify expertise he already had, then verified every finding before submitting it.

Stenberg's reaction was unambiguous: "Actually truly awesome findings. AI can be used for good. Powerful tools in the hand of a clever human is certainly a good combination."

Google's Big Sleep agent found a zero-day in SQLite that 150 CPU-hours of traditional fuzzing couldn't reproduce. OSS-Fuzz discovered a vulnerability in OpenSSL that had been present for two decades — one that was, in Google's words, "undiscoverable with existing fuzz targets written by humans." OpenAI's Aardvark earned 10 CVEs across open source projects.

The pattern is consistent. Stan Lo, a Ruby core contributor who maintains IRB and RDoc, named it plainly: AI is "a multiplier, not a leveler." It magnifies whatever the developer brings to the table. If that's deep expertise, you get 170 valid bugs. If it's the desire to collect a bounty without understanding the code, you get hallucinated vulnerabilities referencing functions that don't exist.

The Apprenticeship Crisis

A blue octopus at a desk overwhelmed by a tidal wave of glowing pink PR cards — the volume problem of AI-generated code

There is a deeper problem the code review crisis obscures, and I don't know how to solve it.

Software engineering has always relied on apprenticeship. Junior developers learn by doing routine work under supervision — fixing bugs, writing tests, refactoring small modules. The mundane work is the training ground. Code review is how knowledge transfers from senior to junior.

AI is short-circuiting that pipeline. If an AI can write the test, fix the bug, and refactor the module, what does the junior developer do? If the answer is "review AI output," we've replaced learning-by-doing with learning-by-reading, which is a fundamentally different — and I suspect weaker — pedagogical model.

On HackerNews, a developer with 30+ years of experience reported "incredible results" from AI tools — but only by leveraging decades of architectural knowledge to direct the AI as a collaborator. Another commenter, after 400 hours on an LLM-assisted project, described it as "QA testing the work of a bad engineer" that was "exhausting," produced nothing shippable, and "yielded no new skills."

The Qodo State of AI Code Quality report found that senior developers see the largest quality gains from AI (60%) but also report the lowest confidence in shipping AI-generated code (22%). The people best equipped to use AI are the most cautious about it. That's not a contradiction — it's pattern recognition. They've seen enough code to know what "almost right" costs downstream.

GitClear analyzed 211 million changed lines of code between 2020 and 2024. Copy-pasted code rose from 8.3% to 12.3%. Refactored code fell from 25% to under 10%. Code churn — new code revised within two weeks — doubled. For the first time in GitClear's recorded history, copy-paste exceeded "moved" code as a proportion of changes.

We are generating more code, understanding less of it, and reviewing even less of that.

What Replaces the Gate

Linus Torvalds, characteristically, cut through the posturing. When the Linux kernel community spent months debating documentation-based AI policies, he dismissed it: "There is zero point in talking about AI slop. That's just plain stupid. The documentation is for good actors, and pretending anything else is pointless posturing."

His counterpoint: AI for code review might actually help. Meta's BPF team reported that automated reviews were 60% good, with another 20% having some useful observations. Torvalds highlighted machine-assisted patch reviews as "stunning" — they identified all his concerns plus additional issues. "Developers have long been complaining about a lack of code review; LLMs may just solve that problem."

This is the irony at the center of all of it. The same technology flooding maintainers with unreviewed code might be the only thing that can scale review to match. Not by replacing human judgment — nothing I've seen suggests AI can distinguish good architecture from bad — but by handling the mechanical parts of review (style, security patterns, test coverage) and letting humans focus on the question machines can't answer: does this change belong in this project?

Addy Osmani framed the shift: "AI did not kill code review. It made the burden of proof explicit." The emerging model is contract-based verification — intent statement, working proof via tests, risk assessment, focused human review of the parts that matter. Less reading every line. More verifying the contract.

Whether that's enough, I genuinely don't know.

Honest Limitations

This piece leans heavily on the most dramatic incidents — curl, Matplotlib, tldraw — because they make the clearest narrative. The reality is probably more mundane for most projects. Not every repository is drowning in AI slop. Most open source projects are small enough that the PR volume problem hasn't hit them yet.

The METR study I keep citing has 16 participants. That's a small sample. The Accenture study with 4,000+ developers was funded by GitHub, which has a financial interest in Copilot's success. Neither is definitive. I've presented them as a tension because that's what the data supports — not resolution.

I also wrote this piece partly with AI tools. The research agents that gathered these sources ran in parallel across seven threads. That makes me exactly the kind of person who benefits from the multiplier effect while worrying about what it does at scale. I don't have a clean answer for that contradiction.

The question I can't shake: if code review was the last quality gate, and we're watching it become economically unviable, what's the next gate — and who gets to stand at it?