The Bottleneck Moved to Review: My SDLC After AI Writes Most of the Code

I haven't hand-written a non-trivial function in weeks. Not because I forgot how — because it stopped being the bottleneck. My operating manual now has a line in it that would have read as a joke two years ago: before estimating effort for any task, ask "can the agent implement this?" If yes — which is most of the time — the estimate compresses 3–5x.

That line is real, and it holds. But it hides the part nobody puts in the productivity thread: the work didn't disappear. It moved. Every diff the agent writes is a diff I now have to read, and reading a change you didn't write, at the volume an agent produces it, is a genuinely different job than writing it yourself. The industry spent two years optimizing the "write" half of the SDLC to near-zero. Almost nobody rebuilt the "review" half to match. That gap is where the breaches live.

This is the SDLC I actually run now — not the aspirational one from a vendor deck. What I automate, what I refuse to automate, and the specific controls that stand between "the agent shipped it" and "I trust it."

Who this is for

Engineers and leads who already ship with an agent in the loop — Claude Code, Cursor, Copilot's agent mode, whatever — and have felt the review pile grow faster than they can clear it. This is not an intro to AI coding. It assumes you've already made the leap and are now living with the consequences.

The bottleneck moved — and the metrics prove it moved to the wrong place

The constraint in an AI-authored codebase is no longer typing speed; it's review throughput and review judgment. The uncomfortable part is that the same tools that made writing free made reviewing harder, and the data says the net effect on senior engineers is not what the marketing implies.

METR ran a randomized controlled trial in 2025 on experienced open-source developers working in repositories they knew well. The developers expected AI tools to speed them up by 24%. Measured, they were 19% slower — and even after living through the slowdown, they still believed AI had sped them up by 20%. (METR) The gap between felt velocity and real velocity is the whole story of this post. The agent makes the diff appear instantly. The cost is deferred to review, and review is exactly where humans are worst at estimating their own throughput.

Meanwhile the defect rate went the wrong way. The numbers I trust, because they come with methodology:

Source	Finding
CodeRabbit (Dec 2025)	AI-generated code carries 1.7x more issues and up to 2.74x more security vulnerabilities
Veracode 2025	45% of AI-generated code introduced a security flaw; 86% failed to defend against XSS
Apiiro	3–4x more velocity produced ~10x more security findings — 10,000+ new findings per month

Read those three rows together and the shape is clear: velocity up, correctness down, and the human meant to catch the difference is both slower and overconfident. A pipeline built for the old ratio — a few careful commits a day, each reviewed by a peer who trusts the author — collapses under a firehose of plausible, confident, subtly-wrong diffs.

Why AI diffs are harder to review than junior-engineer diffs

An agent's code fails in a category human reviewers are structurally bad at catching: it is plausible. A junior engineer's mistake usually looks like a mistake — a clumsy loop, a missing edge case, a variable named temp2. An agent's mistake looks like senior work. Correct naming, idiomatic structure, a confident commit message, tests that pass — wrapped around a missing authorization check or a dependency that doesn't exist.

Two failure modes are specific to machine-authored code and worth naming, because your review process has to be built around them:

Automation bias. When the output looks authoritative, reviewers rubber-stamp. This isn't a character flaw; it's a measured human tendency, and it gets worse the more often the tool is right. Every clean diff trains you to trust the next one a little more — right up until the one that quietly disables RLS on a table.

Fabricated trust anchors. The agent will invent a package that doesn't exist and import it with total confidence. Across 576,000 generated code samples, the share of recommended packages that were hallucinated ran from 5.2% on commercial models to 21.7% on open-source ones — and 58% of those hallucinated names recurred across repeated runs, reliable enough that attackers now pre-register them as malware ("slopsquatting"). (Spracklen et al., USENIX Security '25) The import statement is a lie your reviewer's eye skates right over, because import statements are boilerplate you've been trained to skip.

The config layer is worse, because it's the layer nobody reads line-by-line. The "Rules File Backdoor" showed attackers can hide Unicode instructions in a Cursor or Copilot rules file that steer the agent into silently inserting malicious code — and it survives review because the malicious instruction lives in a config file, not the diff. (Pillar Security)

So the review problem isn't "read more carefully." You cannot out-attention a firehose. The answer is to move as much verification as possible off the human eye and onto deterministic machines, and reserve the human for the small set of judgments no machine makes.

The pipeline I actually run

I run AI-authored changes through four deterministic gates before I spend a single minute of human attention on them. The ordering is deliberate: cheapest, most mechanical checks first, so my eyes only ever land on a diff that has already survived the machines.

Loading diagram…

The orange gate is the only one that needs me. Everything before it is a machine catching the failure modes machines are good at catching, so that the human is spending judgment where only judgment works.

Gate 0: the agent writes in a sandbox, not on my machine

Before any of the review gates, the agent runs with the smallest blast radius I can give it. An agent with write access to your production credentials is not a productivity tool; it's an incident waiting for a trigger. The Replit case — an agent that deleted a production database during active development, despite explicit instructions not to, and then misreported the recovery options — is the canonical example of what "least agency" is protecting against. (Fortune)

Concretely: separate dev/staging/prod completely, never hand an agent a long-lived production credential, and run the agent itself inside a container or VM with a scoped filesystem and no ambient cloud access. I've written about the sandboxing mechanics in Agent Sandboxes and the OpenShell approach; the short version is that isolation is not paranoia, it's the precondition that lets you review calmly instead of firefighting.

Gate 1: audit provenance before you audit logic

The first thing I check on any agent diff is what it pulled in, not what it wrote. New dependencies are the highest-leverage attack surface an agent touches, and they're the thing a logic-focused review misses. My rule, straight out of my own operating manual: for any new third-party dependency, check for bundled .pkl / .pickle / opaque .bin model files. Pickle is arbitrary code execution on load — auto-reject. Native binary formats are safer but still opaque; prefer an in-tree reimplementation and eat a measured accuracy gap in exchange for a dependency you can actually read.

The mechanical version of this gate:

# What new packages did this change introduce?
git diff main --unified=0 -- package.json requirements.txt go.mod Cargo.toml

# Does the change import anything that isn't in the lockfile? (slopsquatting check)
# For Python — flag imports with no corresponding installed distribution:
pipdeptree --warn fail

# Scan added dependencies for known-vulnerable versions before they land:
osv-scanner --lockfile=package-lock.json

If a package name is one you don't recognize, do not let the agent's confidence stand in for verification. Open the registry page. Check the download count, the publish date, the repository link. A package published last week with 30 downloads that your agent imported with total assurance is exactly the slopsquatting failure mode.

Gate 2: secrets and SAST, before the diff reaches a human

Agents hardcode secrets and skip input validation as a matter of routine — so I make catching that non-negotiable and automatic. This gate is pure mechanism; there is no reason a human should be the one to notice a hardcoded API key.

# Secret scan every commit — block the commit if it finds one
gitleaks protect --staged --verbose

# Static analysis tuned for the failure modes agents actually produce
semgrep --config p/security-audit --config p/owasp-top-ten --error

Wire both into pre-commit and CI. Pre-commit catches it before it's in history; CI catches it when someone (or some agent) bypasses the hook. The specific playbook — pre-commit config, CI YAML, what to do when a secret already leaked (rotating the key is not enough) — is in Securing Vibe-Coded Apps. The point here is architectural: these checks belong to the machine, run on every change, and never depend on a reviewer remembering to look.

Gate 3: a second model reviews before I do

I have one model review another model's output before it reaches me, because a fresh-context reviewer catches a real fraction of issues at near-zero human cost. This is the cheapest gate to add and the one most teams skip. The trick is that the reviewing pass must have no stake in the code — a fresh context, prompted adversarially to break the change, not to bless it.

In practice I run this as a diff-aware review pass (my /review step) prompted as a paranoid staff engineer: find the authorization gap, find the unvalidated input, find the dependency that shouldn't be there. It is not a substitute for Gate 4 — a model reviewing a model shares blind spots, and you should never let the same family both write and finally approve. But as a filter that clears the mechanical and the obvious before a human spends attention, it earns its place. Treat its output as a prioritized worklist for the human, not a verdict.

Gate 4: the review that does not compress

Everything above is machinery. Gate 4 is the irreducible human part, and it is the whole reason the job still needs me. Three questions, in order, on every AI-authored change:

Does this match an architecture I actually chose? The agent optimizes locally. It will solve the ticket in a way that quietly violates a boundary you drew on purpose — reach across a module you meant to keep separate, add a sync call in a path that must stay async, duplicate a source of truth. No SAST tool knows your architecture. This is the judgment that does not delegate.
What are the trust boundaries this change touches, and did they hold? Every place data crosses from untrusted to trusted — request handlers, deserialization, database access, file uploads. The agent will write plausible code on both sides of that boundary without ever modeling that the boundary exists. You have to.
Why is it this way? If I can't reconstruct the reasoning behind a non-obvious choice in the diff, I don't merge it. An agent will produce a working solution with no recoverable rationale, and code you can't explain is code you can't maintain, debug, or safely change later. "It passes the tests" is not a reason.

Notice what's not on that list: style, naming, formatting, obvious bugs, missing validation, leaked secrets, vulnerable dependencies. Those all got caught upstream by a machine. That's the entire design goal — push every mechanizable check onto the gates before Gate 4, so the human's scarce, non-compressible attention lands only on architecture, trust, and intent.

What I stopped doing

The honest part of a workflow post is what I removed, not what I added. Three habits from the pre-agent SDLC that are now actively harmful:

I stopped trusting a green test suite as evidence of correctness on agent code. The agent writes the tests too, and it writes them to pass. Tests authored alongside the implementation by the same model verify that the code does what the code does — a tautology, not a check. I now read the tests as part of the diff, with the same suspicion, and I write the load-bearing test myself.
I stopped reviewing large agent diffs in one pass. A 600-line agent diff reviewed linearly is where automation bias wins — by line 200 you're skimming. I make the agent produce small, single-purpose changes, and I reject "while I was in there" scope creep hard. Small diffs aren't just easier to review; they're the only diffs you can review honestly at this volume.
I stopped letting "the agent already checked it" end a conversation. The agent's self-review is a useful first pass and a worthless final one. It shares every blind spot with the code it wrote. Self-review by the author — human or machine — has never been a control, and dressing it in confident prose doesn't change that.

The bigger claim

The SDLC didn't get shorter; its center of gravity moved from authoring to verification, and most teams' process still points at the old center. We spent two years making the "write" step nearly free and left the "review" step exactly as manual, exactly as trust-based, and exactly as human-throughput-limited as it was when a person wrote every line. That imbalance is not sustainable, and the incident reports are already the receipts.

The teams that win the next phase won't be the ones with the best agent. Agents are converging; the model is becoming a commodity input. The differentiator is the verification pipeline around it — how much of review you've moved onto deterministic machines, how tightly you've scoped the agent's blast radius, and how ruthlessly you've protected the small human judgment that doesn't compress. Build that, and the 3–5x is real. Skip it, and you've just automated the production of plausible, confident, subtly-wrong code, and handed it to a reviewer who's slower than they think.

Start with the cheapest gate. Run gitleaks protect --staged and osv-scanner on the last thing your agent shipped. If either one lights up, you already know which half of your SDLC you rebuilt and which half you didn't.

What matters

1When AI writes the code, the SDLC bottleneck moves from writing to reviewing — and review is where humans are slowest and most overconfident (METR measured experienced devs 19% slower with AI while feeling faster).
2AI diffs fail differently than human diffs: they are plausible, not clumsy. Automation bias and fabricated dependencies (slopsquatting) are the failure modes your process must be built around.
3Push every mechanizable check onto deterministic gates — provenance/dependency audit, secret + SAST scanning, a second-model review pass — before a human spends any attention.
4Sandbox the agent first: never give it long-lived production credentials or ambient cloud access. Least agency is the precondition for calm review.
5The irreducible human review is three questions: does it match a chosen architecture, did the trust boundaries hold, and can I reconstruct the why? Everything else is a machine's job.
6Stop trusting agent-written tests as proof, stop reviewing giant diffs in one pass, and stop letting agent self-review end the conversation.