The green build lied: the bugs AI hides behind passing tests

Every test passed and the green build looked done. More than twenty real bugs were still hiding behind it, and here is the layered review that caught them.

Every check was green. Types, tests, coverage, build, all passing. By the usual definition of "done," I was done.

So I spent the next stretch hunting for everything that green could not see. More than twenty real bugs turned up. Any one of them could have shipped. A couple would have quietly broken the product for every user, with the test suite cheerfully reporting success the whole time.

This is the story of the checks I have learned to run on top of the green ones, and the kind of bug each one catches.

Why a green build is a weak promise

A passing build has always told you less than it seems to. It says the code is mechanically fine: it compiles, the types line up, the tests pass. It does not say the code does the right thing. Those are different claims, and the gap between them is where products quietly break.

AI widens that gap. It is brilliant at producing code that runs and tests that pass, which is not the same as code that is correct and tests that check anything. Ask an AI for tests and it will happily hand you one that calls your function and then inspects almost nothing about the result. Your coverage report lights up green over the top of it. The old worry was "nobody wrote a test for this." The new one is "there is a test, it is green, and it is hollow."

The faster AI writes, the more I want something pushing back. So I have built up a set of checks that deliberately disagree with each other, each one catching what the others miss.

The product they run on

Those checks are only as good as the thing they run on, and you cannot stress-test a safety net on a toy. So the system underneath is real: an AI-native hyper-personalized outreach CRM I am building. Its pipeline finds candidates, scores each against an ideal-customer profile, and drafts a first message for the ones that clear the bar, with the option to dig up a richer dossier and rewrite from it first. A person reviews and acts; it never sends on its own.

In one session I wired that spine end to end. It is a walking skeleton: the spine runs on real data moving through real handoffs, which is what surfaces the bugs that matter, but the edges are still stubbed. The sources that feed it and the enrichment step are placeholders, not real integrations yet.

Layer one: the harness, a row of instruments

The checks come in two layers. The first, the harness, is fully automated: one command, npm run verify, run on every change. (The second layer, human judgment, comes after.) The trick is to stop thinking of it as "the tests" and picture a row of instruments, each measuring a different property. A green gate means all the needles read in range at once. It does not mean the plane is flying right.

Each instrument answers a question the others cannot:

Sensor	Tool	The question it answers	What it catches
typecheck	`tsc --noEmit`	Do the types hold?	bad signatures, type errors
lint	ESLint	Idiomatic and safe patterns?	unsafe patterns (a raw `<a>` where a `<Link>` belongs)
format	Prettier	Consistent formatting?	style drift
depcruise	dependency-cruiser¹	Are the boundaries intact?	layering violations, port-skipping, cycles
dup	jscpd²	Is logic duplicated?	copy-paste, missing abstraction
coverage	Vitest³ + v8	Was this line executed by a test?	untested code
build	`next build`	Does it compile, and does the server-only boundary hold?	secrets leaking into the client bundle
canon	Vitest³	Do the docs still match the code?	doc rot, broken decision records
mutation	Stryker⁴	Would a test fail if behavior changed?	tests that assert nothing

The first seven run together as npm run verify on every change; the canon checks run in CI on every push, and mutation testing runs at milestones. Four of these are worth dwelling on, because they are where the harness stops being standard hygiene and becomes a correctness instrument.

Boundaries as build-failing rules, not code review

Architecture only survives if a machine enforces it. Written in a wiki, "the lower layer must never import the UI" lasts until the first deadline. So the boundaries are encoded as architectural fitness functions in dependency-cruiser, and a violation turns the gate red:

no cycles between modules (a cycle is a load-order hazard waiting to happen),
the data and logic layer may never import the UI layer,
the Node-only bootstrap is reachable only through one dynamic import, so nothing Node-specific leaks into the part of the app that has to compile for the edge,
and the ports for the LLM, enrichment, and signal sources must not import their concrete adapters.

That last rule had a hole in it that a later whole-system review caught and turned into a real build rule.

A zero-percent duplication threshold, scoped on purpose

jscpd is set to fail on any duplication: as little as fifty tokens repeated across five lines turns the gate red. That is deliberately aggressive. It forces a shared abstraction the moment a second copy appears, instead of waiting for the mess to spread.

It also fights you if you let it. A zero threshold creates a real tension with the old "rule of three" instinct (tolerate the second copy, extract on the third). I resolved it toward extraction for genuine shared logic (a shared timestamp-columns helper, a single loader for an actionable prospect, one base query) while telling the scanner to ignore test files and the throwaway prototype. The lesson is not "zero duplication is virtuous." It is that a forcing function without a scope spends your day forcing the wrong things. Decide what it may look at, then let it be ruthless inside that boundary.

A coverage floor that cannot be gamed by averaging

Coverage is per-file, not global, and the floor sits just under the measured baseline, ratcheting up over time. Per-file matters: a global average lets a brand-new untested module hide behind a well-tested old one. Per-file, the new module has to carry its own weight.

The honest part is written into the config itself. Coverage answers one question only: was this line executed by a test? It does not answer would a test fail if this line were wrong? Those are different questions, and the distance between them is where hollow tests live. Several files are excluded from the floor on purpose, each with a written reason beside it: thin queue wrappers that only delegate, network adapters that need a live key, the declarative database schema, the logger, the app shell. An exclusion with a reason is a decision. An exclusion without one is a leak.

Mutation testing, the instrument that reads assertions

This is the one most teams do not have, and the one that matters most when an AI writes your tests.

Mutation testing turns the question on the tests themselves. It changes your source in tiny ways (flips a comparison, blanks a string, drops an argument) and reruns the suite. If the tests still pass, that mutation "survived" - you have a test that runs the code but asserts nothing about it. A surviving mutant is a hollow test caught red-handed. It earned its place at once: a test confirmed that an error was raised but never looked at what the error said, so the message could be blanked out entirely and the test stayed green. A whole category of failures could have gone dark while the suite swore everything was fine.

It is expensive, since it reruns the suite once per mutation, so it lives out-of-band, at milestones, never in the per-change loop, and it is pointed only at pure, no-I/O logic - the same surface the coverage floor guards.

Layer two: review, at three distances

Machines catch a lot, but each one reads a piece of code in isolation. None can tell that a change quietly contradicts a decision made three features ago, or that a bug is hiding in the seam between two parts that are each fine on their own. That takes a reading with judgment.

That reading is done by specialized AI review agents. Each is a sub-agent with a narrow brief and one hard limit: it can read and report, but never change anything. I run them through commands, at set points in the work, and I make the final call on what they surface.

That last limit is not negotiable, and I learned why the hard way. Early on, a review agent decided to be helpful, ran a command it should not have, and scrambled the records of which database changes had been applied. The very next run noticed and put it right, so nothing was lost, but the lesson stuck: a reviewer that can change what it is judging is not a reviewer. It is a second author you are not watching.

Only the harness runs automatically: a red build blocks the merge, no human needed. The reviews are not forced by any tool. They are just my own rule for what counts as done - I do not call a change finished, or a milestone closed, until its review comes back clean.

The harness has already covered the closest zoom, a single function, where coverage and mutation testing do the whole job. Human review picks up one level out and widens from there, across three distances:

1. One change, read in context. I run this after every change, and again after every fix, because a fix is fresh code written into the exact spot just flagged, and nobody has reviewed the fix itself yet. The pass reads the change together with what surrounds it: its callers and the assumptions it relies on, never the patch alone.

2. The whole system. I run this at two moments: the first time separately-built features are merged into one tree (a feature can pass on its own branch and still break once merged), and again, more lightly, at each milestone. The command is /system-review. It dispatches three read-only lens agents in parallel, cut by where a failure can live rather than by topic, plus a fourth agent, the chair, that merges their findings and ranks them.

Each lens is a real sub-agent, defined by its own file in the repo. Here is the header of each definition; the full prompt body is omitted.

static-composition is the structural lens. It checks how the pieces are wired - the composition root, the direction of dependencies, and server-only boundaries - and its main output is a list of new build rules to add.

---
name: static-composition
description: System-review lens A - audits STRUCTURAL
  properties of the wired code tree (composition-root wiring,
  port direction, server-only/secret residency,
  fitness-function gaps). Its primary output is a list of
  build-enforceable fitness functions to write. Read-only;
  critiques code, never edits.
tools: Read, Grep, Glob
model: sonnet
---
[... full prompt body follows ...]

lifecycle-reachability is the judgment core, the one lens I run on the strongest model. It models each entity's state machine and owns the strand: a record left in a state it can never leave, given the adapters actually shipped rather than the behavior their interfaces promise.

---
name: lifecycle-reachability
description: System-review lens B - the irreducible judgment
  core. Models each entity's state machine and proves every
  state has a reachable outbound edge UNDER THE REAL adapter
  dispositions in the tree (not the nominal contract). Owns
  the strand-bug class. Crash/retry/partial-failure is a
  depth modifier. Read-only; critiques code, never edits.
tools: Read, Grep, Glob
model: opus
---
[... full prompt body follows ...]

invariant-canon is the consistency lens. It checks that one business rule lives in one place, that derived facts stay derived, and that the code still matches the locked decisions, the ADRs.

---
name: invariant-canon
description: System-review lens C - audits SEMANTIC
  consistency across the code tree (knowledge-DRY,
  derived-vs-stored consistency, and code-vs-canon drift -
  the implemented Current Architecture vs the Planned
  Architecture in the ADRs and domain model). Read-only;
  critiques code, never edits.
tools: Read, Grep, Glob
model: sonnet
---
[... full prompt body follows ...]

Two rules keep the panel honest, and both are written into the agents' own definitions. The first is a filing bar that all three lens agents carry word for word:

# from static-composition, lifecycle-reachability, and invariant-canon
A finding is admissible only with an exhibited path: a concrete
file:line -> file:line trace ... No path, no finding.

A finding counts only if the reviewer can point to the exact path to the bad state, and the chair that merges the findings throws out any that arrive without one. There is no separate "is this real?" vote.

The second is static-composition's standing order:

# from static-composition
most of what you find should become a fitness function. For every
structural finding, name the dependency-cruiser rule or test that
would retire that class of bug into the build forever.

So when a finding is something a machine could check, the fix is not just to patch the code; it is to add a new fitness function to the harness from Layer one: a plain code check, a dependency-cruiser rule or a test rather than anything AI-driven, that the gate runs on every change. From then on that whole class of bug is caught by the build itself, so the next review has fewer things left for a person to find.

3. The architecture. This runs only when a major design decision changes, and it never looks at the code at all. It reads the design decisions themselves, checking that each still rests on a real reason and that a new decision does not quietly contradict one already locked. That tier is a whole workflow of its own, run through its own commands, and I wrote it up separately in how I do architecture with AI without letting it make the decisions.

What I take from this

Step back from the individual bugs, and the same few lessons keep surfacing.

Every check answers a different question, so I stack them. Coverage says "this ran." Mutation testing says "a test would notice if it broke." The boundary rules say "the architecture is intact." The review passes say "this is right in context" and "this holds together across the whole system." None of them can do another's job.
A passing build has a hard ceiling. It cannot see logic in context, bugs in the seams, or whether a test checks anything at all. Every real bug here was on the far side of green. If a green build is all you trust, you ship all of them.
If AI writes your tests, test the tests. It is the one check built for the exact failure AI falls into: tests that run everything and verify nothing.
A good review process tries to delete itself. Anything a machine can check becomes an automated rule, so a person never hunts for it twice. Reserve human attention for the judgment no rule can capture.

The honest scoreboard

One session, one walking skeleton: 125 tests, no duplicated logic, a clean build, and a 94% mutation score (the tests caught 94% of the deliberate bugs thrown at them). A clean board by every standard measure.

Behind it, more than twenty real bugs that the green build was perfectly happy with. The most expensive would have quietly broken the product for every user, and not one automated check could have surfaced it.

The scarce thing in 2026 is not generating code and tests that pass. Everyone has that now, and AI makes it nearly free. What is scarce is the judgment to build checks that argue with each other, and the nerve to keep reading after the light turns green. A green build is not the answer. It is the first question.

dependency-cruiser, architectural fitness functions for JavaScript and TypeScript. https://github.com/sverweij/dependency-cruiser ↩
jscpd, copy-paste and duplication detector. https://github.com/kucherenko/jscpd ↩
Vitest, the test runner and coverage tool used here. https://vitest.dev ↩ ↩²
Stryker Mutator, mutation testing for JavaScript and TypeScript. https://stryker-mutator.io ↩