How I build software with AI agents: the complete setup
How I build software with AI agents without shipping their mistakes: the specs, gates, and review loops that catch the AI when it is confidently wrong.
I am building a product where AI agents write most of the code. The hard part is not the code. It is everything around the agents that keeps them fast and stops them from shipping something wrong. I call that supporting structure the harness. Its most important rule is a strict one: the AI never decides that its own work is good enough. Code decides that. This is the full tour, from the first idea to shipped code.
I have written about parts of this setup before: why I write the spec before the code, the bugs a green build hides, how I design architecture with AI without handing it the decisions, and what happens when AI games its own checks. This article puts the whole thing in one place.
An AI agent works like a very fast junior developer. It is capable, it never gets tired, and every so often it is wrong while sounding completely sure of itself. You cannot fix that by writing a longer rulebook for every task. Three moves do most of the work.
- Match the effort to the risk. A typo fix and a database redesign carry very different risk, so they should not carry the same process. Cheap, reversible changes take a light path. The rare change that is expensive to get wrong takes the full one. Do it any other way and you either slow down the easy work or under-check the dangerous work.
- Check each thing the way it actually fails. A document and a program fail in different ways, so I check them in different ways. A document can state something false and sound completely sure, so I check its claims against real sources. Code fails by doing the wrong thing while still looking fine, so I make it prove it works: it has to run, pass tests that actually check behavior, and match the spec it was built from. A green build is where that proof starts, not where it ends.
- Automate every check a machine can do, and spend people only on judgment. No one should have to hunt twice for a bug a rule could catch. So every problem that keeps coming back becomes an automatic check, and the human time that is left goes to the things no rule can decide. That list of human-only checks keeps getting shorter.
The rest of this article shows these three moves in practice.
The map
Here is the whole flow, from an idea to shipped code.
Each box is labeled with who does the work: You (a person), AI (an agent), or Auto (an automatic check, with no AI and no person involved).
Both lanes can produce an ADRArchitecture Decision Record: a short, dated, permanent note of one decision and the reasoning behind it. Once accepted it is not edited, only superseded by a newer one.. That is a short, permanent note that records one decision and the reason for it. The next sections explain when it appears.
Three actors do the work, and each one does what it is best at.
- You make the judgment calls, and only those. Whether a change shapes the architecture. Whether a design decision is sound. The final decision on every review.
- AI agents do the production work. They draft the spec, write the code, and run the reviews.
- Auto is the safety net that never thinks. It is the automatic gate on every change and the automatic re-check on every push. It has no opinion. It either passes or it fails.
The two lanes are different only at the start, in how much design work happens before the code is written. After that they join into one shared path: the same gate, the same review, the same push, the same final check. The rest of this article walks that shared path first. Then it comes back to the extra work the architecture lane does at the start.
Spec before code
Before writing any code, the agent writes a short set of documents. I use a tool called OpenSpeca spec-first workflow: the agent writes a proposal, spec, and task list before touching code, and the code is reviewed against that spec afterward. for this, and we call the workflow OPSX. The spec is the source of truth. The code has to match the spec, not the other way around.
An ADR is written only when the change records a durable decision; most changes write none. The agent then works through the task list, and later we review the code against the spec.
Why not let the agent write code straight from a prompt? Because a prompt disappears the moment it runs. If the only record of what the change should do is the code itself, then there is nothing to check the code against except more code. The spec is a separate description of what the change should do, written before the code exists. It gives the review something independent to check against, and it gives the agent one fixed target to hit.
The dial: two lanes
This is the most important time-saver. Almost every change takes the same default lane. Only the rare change that shapes the system itself takes the heavier architecture lane.
First, the part that is the same in both lanes. Every change runs through the same workflow: the AI writes a spec, you check it, the AI writes the code, then the change is verified and archived. Even a one-line bug fix goes through this. We do not fix things straight from a prompt. The only exception is a change to our own tooling, config, or process, which we edit by hand.
The lane is decided by one question: does this change shape the architecture?
| Lane | Use when (and what the AI writes) | What you approve by hand |
|---|---|---|
| Default | Anything that does not reshape the system: a bug fix, a rename, a tweak, or a durable decision that fits one record plus code. The AI writes a spec and a task list, plus one ADR, but only when the change actually records a durable decision. Most changes record none. | The code change, after a quick review. Plus the ADR, if one was written, because it becomes permanent. |
| Architecture | The change reshapes stored data, changes the architecture, or sets a boundary future changes must live with. The AI writes the full set: an explore note, diagrams, a domain model, and an ADR. | The code change, plus the whole design, after a panel of agents has reviewed it, before any code is written. |
Notice the ADR in the default lane is conditional. The AI writes one only when the change records a decision that future work must live with: a retry policy, a data format, a rule. For a rename or a typo fix it writes none. The flow is fixed, but the paperwork is only as heavy as the decision inside the change. An ADR is worth looking back on later precisely because we do not write one for every trivial change.
Most changes are the default lane with no ADR, and that is the point. The ceremony only shows up when the decision is real.
When it reshapes the system: the architecture lane
The architecture lane is for the biggest decisions, the ones that shape the system itself. Here we design the architecture, not just write a feature. It is the slowest and most expensive lane, so we only use it when the decision is big enough to deserve it.
Four ideas make this work.
- Keep research and decisions apart. The messy research goes in an explore note. The final decision goes in a separate ADR that links back to the note, so anyone can see why we chose what we chose.
- Label every claim as you write it. Is this an outside fact? Does it follow from an earlier decision? Or is it a new choice we are making here? The labels are hidden HTML comments, so they do not show up when the document is read:
<!-- v:fact -->,<!-- v:derives -->,<!-- v:decision -->. A claim with no label counts as unsupported and does not ship. If a claim cannot earn a label, we do not write it. - Check the facts before debating the design. First, a
ledgeragent checks every labeled claim. A fact gets its source checked. A "follows from an earlier decision" claim gets re-derived from that decision. Only after the facts hold does a panel of expert agents attack the design, and then a chair writes one verdict. A person has to sign off before anything becomes permanent. The order matters. A model that checks its own work catches the claims it was unsure about, but it misses the ones it is confident and wrong about. One real example: a wrong claim about how our job queue waits for new work sounded completely right and would have passed a self-check. Only checking the real source caught it. - Generate the clean version, never hand-write it. When the design is approved, we strip out the labels and publish the clean diagrams into our main architecture docs. The labeled document stays the one we edit. The clean version is generated from it. The two can never drift apart, because one is built from the other.
Decisions do not get rewritten. We never edit an old ADR. If a decision changes, we write a new ADR that replaces the old one and links back to it. The history stays honest.
The diagrams use C4a way to draw a system at zoom levels: the whole system, then its containers, then the components inside each one. It keeps a diagram at one level of abstraction instead of mixing them., a way to draw a system at different zoom levels: the whole system, then the running pieces inside it, then the smaller parts inside each piece. Next to the diagrams is a domain model, which names the main things in the system and the states they can be in. Both are what the design panel reviews, and both get published once the checks pass.
The automatic gate
Before any person reviews the work, one command has to pass: npm run verify. It checks the whole codebase, not only the lines that changed. So it also catches damage a change does to the rest of the system.
This is the cheap layer that never gets tired. If any check fails, the work stops right there. The clever check is the boundaries one. Our architecture rules are written as checks that fail the build1, so the agent simply cannot connect two parts of the system in a way the rules forbid.
A rule written in a wiki gets ignored at the first deadline. A rule that fails the build is followed every time.
Two more checks run outside this per-change loop. A duplication scanner2 fails if the same code is copied in two places. That forces us to put shared logic in one place, instead of two copies that can slowly drift apart. And mutation testinga tool that deliberately introduces small changes into your source and reruns the tests. If the tests still pass, the change went undetected: that test executed the code but did not actually check its behavior.3 runs at milestones to find empty tests. It makes a small change to the code on purpose, like flipping a comparison or blanking out a piece of text, then runs the tests again. If the tests still pass, that test ran the code but checked nothing about it. Most teams do not have this check. It matters most when an AI writes the tests, because a test that runs the code and checks nothing is exactly what an AI gives you when you ask it for test coverage.
Two kinds of review: one for documents, one for code
This is the core of the whole thing. After the automatic gate passes, a person-level review begins. It splits in two, because documents and code fail in different ways.
A document can state something false and still sound sure of itself. So for a document, we check every claim against real sources first, and only then argue about whether the design is good. Code is different. It is concrete, so you can follow exactly what it does, step by step. So for code, we do that: is everything connected, can a record get stuck with no way forward, has the code drifted away from the spec. Both reviews end the same way. One chair agent removes duplicate findings and writes a single verdict, instead of three reviewers talking over each other.
The code review has one strict rule that removes all "is this a real bug?" debate. A finding only counts if the reviewer can show the exact path to the broken state, written as a trace from one file:line to another. No path, no finding. With code you can always show the path, so showing the path is the proof. The chair drops any finding that does not have one.
The reviewers
The reviewers are small AI agents, each with one narrow job. Each one is a sub-agenta separate AI instance spun up for one narrow job, with its own instructions and its own limited set of tools. Here every reviewer can read and report but never change anything. that looks at exactly one thing, so they do not overlap. One chair agent combines their findings at the end, whichever set has run.
The five document reviewers, each checking one thing:
The three code reviewers:
Giving each reviewer one job is on purpose. Separate reviewers that each hunt for a different kind of problem find more together than one big reviewer trying to find everything. The document roster is even built to disagree. Two of them (Canon, Pedant) check that the document follows decisions we already made. Two others (Atlas, Greybeard) ask whether those decisions were right in the first place. When the two sides disagree, that disagreement is the most important finding, not noise. It means the decision is still open, and someone has to defend it or replace it.
One rule keeps every reviewer honest, and I learned it the hard way. A reviewer can read and report, but it can never change anything. Early on, a review agent tried to be helpful, ran a command it should not have, and scrambled the record of which database changes had been applied. Nothing was lost that time. But the lesson stuck. A reviewer that can change the thing it is reviewing is not a reviewer. It is a second author working with no one watching.
The review loop
A review is not done in one pass. When a reviewer finds a problem, the fix is new code that no one has reviewed yet, written right into the spot that was just flagged. So we review the fix too. If that review finds something, we fix it and review again. This repeats until a review comes back clean.
The real question is who decides to go around again. It used to be the AI, and that was the weak point. An AI grading its own work tends to stop too early and call it good enough. So we took that decision away from it and gave it to code.
Here is how that works. After each review, the result is written down: clean, or findings. If the last result was findings, the AI is not allowed to stop and move on. A small piece of code sees the open findings and sends it straight back for another round: fix, re-verify, re-review. The AI does not choose to loop. The loop is run for it, and it only ends when a round comes back clean.
A loop like this must not run forever, so code ends it in three ways, and none of them is a judgment call. A clean round ends it normally. Five open rounds also ends it and hands the list to a person. A fix that needs five rounds points to a design problem, not a bad round. And a loop left untouched for eight hours is dropped to a person too, so a session is never stuck.
One more lock sits at the very end. Even after a clean round, you cannot finish the change if the code changed afterward. The archive step is blocked unless the last review was clean and the code is exactly what that review saw. Any edit after a clean review, a fix included, counts as new unreviewed code, so it forces one more round. You cannot slip a last-second change past the review.
Then, after the change is pushed, one final safety net. CIcontinuous integration: an automated service that re-runs the full check suite on every push and blocks the merge if anything fails. runs every check again, so nothing gets in by hand.
So running the loop is now code's job. Code sends it into the next round, code ends it, and code locks the exit. What is left for a person is the judgment inside each review: is this finding a real problem or just a nit? The loop around that judgment runs on its own.
Three worked examples: what you actually type
The two lanes are easier to see as real sessions. Here are three: two in the default lane (one that records no decision, one that does) and one in the architecture lane. Each step below is a command you run and the artifact it produces. The product in these examples is the outreach tool I am building: it finds people, scores each one against an ideal customer profile, and drafts a first message for the ones that clear the bar.
Default lane, no decision: a bug fix
The score bar uses "greater than" where it should use "greater than or equal to," so a person who scores exactly at the bar gets dropped. Small, easy to undo, no lasting decision.
/opsx:propose fix score-bar comparison- the AI creates the change and its documents. →openspec/changes/fix-score-bar/withproposal.md(what and why),design.md(a short note),tasks.md(the checklist), and a spec delta underspecs/qualification/.- You read the spec. It is a few lines and it matches what you meant. Nothing permanent to approve.
/opsx:apply- the AI makes the code change and ticks the task off. → edits the source, marks the task- [x]intasks.md.npm run verify- the whole-codebase gate runs. → all seven checks green./code-review- a review agent reads the change together with the code around it, and you make the final call. → nothing that matters, so nothing to fix. (This change records no decision, so the ADR step writes nothing and there is no ADR gate. The conformance check/opsx:verifyis optional here too, since there is no design to conform to.)/opsx:archive- → runsopenspec archive fix-score-bar --yes, which merges the spec delta intoopenspec/specs/and validates it.git push- → CI runs the full gate again. Green. Done.
Seven steps, one human read, no permanent decision.
Default lane, with a decision: a retry policy
The enrichment step sometimes fails, and you need to decide how many times to retry before giving up. This is a real, lasting decision, but it fits in one decision record plus code. It needs no diagrams, and it overturns nothing. Same lane as the bug fix, same commands.
/opsx:propose enrichment retry policy- the same command as the bug fix, on the same default lane. But this change records a real decision, so the AI's ADR step writes one this time instead of skipping it. → the change folder withproposal.md,design.md, an ADR ("retry three times, then park the person for review"),tasks.md, and a spec delta./verify-gateon the ADR - a new ADR was written, so the gate fires. Because this ADR overturns nothing, it runs the light check: theledgeragent checks the ADR's claims, and thecanonreviewer checks it against past decisions. → a short verification record.- You sign off on the ADR. This step is yours and cannot be skipped, because the ADR is about to become permanent.
/opsx:apply- → the AI writes the retry code and ticks the tasks.npm run verify- → green./opsx:verify- the conformance check: does the code actually match the design and the ADR? Worth running here because there is now an ADR to conform to. → clean./code-review- → you make the final call./opsx:archive- → merges the spec delta, and promotes the ADR todocs/adr/NNNN-enrichment-retry-policy.md, where it is now permanent.git push- → CI green.
Same lane and same commands as the bug fix. The only difference is that a decision surfaced, so an ADR appeared, which added the light gate and your sign-off. Nothing else changed.
Architecture lane: a new stage
You are adding the enrichment stage for the first time: a new step in the pipeline, a new state a person can be in, and a new boundary that later features will depend on. This shapes the system, so it takes the architecture lane.
/opsx:explore enrichment stage- you think the problem through with the AI first. No code, no decision yet. →docs/explore/2026-07-02-enrichment-stage.md, the research note./opsx:propose enrichment stage --schema spec-driven-architecture- the architecture lane, so you pass the schema. The AI writes the full architecture set, and tags every claim in it as a fact, a follows-from, or a fresh decision. →proposal.md,use-cases.md,domain-model.md,system-design.md(the C4 diagrams),deployment.md, an ADR, andtasks.md./verify-gate system-design(and the other artifacts) - the full panel runs:ledgergrounds every claim, then Atlas, Greybeard, Pedant, and Canon attack the design, and the chair writes one verdict. →openspec/changes/enrichment-stage/verification.md, recording the result and a hash of each artifact.- You sign off. Mandatory. Nothing becomes permanent without it.
/opsx:apply- apply first confirms the gate passed, then it promotes the design to canon: it strips the hidden tags and publishes the clean diagrams intodocs/architecture/, and the ADR lands indocs/adr/. Then it writes the code for the new stage. → published canon docs, then edited source with tasks ticked off.npm run verify- → green./opsx:verify- the conformance check, always run in the architecture lane./system-review- because this added a new stage that meets the existing ones, the three code lenses run over the seams: is everything wired, can a person get stuck, has the code drifted from the spec. The chair writes one verdict. →docs/reviews/2026-07-02-system-review-enrichment.md.- You drive the fix loop. The fix stays inside this same change: you point the agent at each finding and it edits the flagged code directly, still against the same spec. There is no new
/opsx:proposefor a fix. Because a fix is new code no one has reviewed, you re-runnpm run verifyand the review on it, and you repeat until a pass finds nothing that matters. You trigger each pass yourself. None of this is automatic. /opsx:archive- → merges the spec delta and validates.git push- → CI green. This is the one step that runs by itself.
Notice what grows down the three lists. The first two are the same lane and the same commands. The only thing that changed is that a decision surfaced in the second, so an ADR and its light gate appeared. The architecture lane adds a research note, a full set of diagrams, a design panel, a promotion to permanent docs, and a whole-system review. The spine is the same every time. The heavy machinery only comes out when the change earns it.
The work left for people keeps shrinking
The review panel is meant to shrink itself over time. Some findings can be turned into an automatic check: a forbidden connection between two modules, a link pointing the wrong way, a secret that could leak to the browser, a record that can reach a state with no way out. Each of those gets closed by writing a new build check1, not by finding it again next time. After that, a machine catches that kind of bug forever, and no person ever has to look for it again.
What is left for a person is the judgment that no check can capture. Can a record get stuck in one of its statesthe set of states a record can be in and the moves allowed between them. A reachable state with no way out is a stuck record.? Judge that against the real parts we shipped, not just what their interfaces promise. Has the built system drifted away from what it was designed to be? These are not things a simple rule can decide. If a review at a milestone only finds problems that could have been automatic checks, the right lesson is that they should have been checks, and the fix is to write them.
There is one more failure to watch for. An AI told to make a check pass will sometimes make the check pass instead of fixing the real problem the check is there to catch. A failing test gets quietly weakened until it goes green. A blocking rule gets an exception added next to it. This is the deeper reason reviewers can only read and never write, and why a code finding needs a real path to a real bug before it counts. If the agent can change the guardrail itself, it is not a guardrail. The checks have to be things the agent cannot quietly get around.
The whole thing in one sentence
Write the spec first. Match the process to what is at stake. Let machines check what machines can check. Save people for the judgment that is left. Then keep making that leftover smaller.
That is the harness. The product is built on AI, and so is the way I build it. The difference between reckless speed and safe speed is not how fast you go. It is the guardrails you put around it.
Appendix: the setup on disk
All of this is just files in the repo. Below is the layout of the harness. Comments that start with example: mark project content that fills the framework, a spec or a decision or code; everything else is the framework itself. A … marks a folder that keeps accumulating instances; a folder shown without one is a fixed, complete set.
wisery-crm/
├── CLAUDE.md # the rules the AI reads every session: stack, definition of done, the two lanes
├── package.json # defines "npm run verify" (the gate) and the review-loop scripts
├── .dependency-cruiser.cjs # the architecture boundaries, written as build-failing rules
├── .jscpd.json # duplication check: fail on any real copy-paste
├── vitest.config.ts # tests + the per-file coverage floor
├── stryker.config.mjs # mutation testing: finds tests that check nothing
├── .github/workflows/ci.yml # CI: re-runs the whole gate on every push
│
├── .claude/ # everything that drives the AI agents
│ ├── settings.json # the hooks: the loop motor (Stop) and the archive gate (PreToolUse)
│ ├── agents/ # one file per reviewer sub-agent: its brief and its single lens
│ │ ├── ledger.md # grounds every claim against a real source
│ │ ├── lifecycle-reachability.md # can a record get stuck with no way out?
│ │ ├── chair.md # merges the findings into one verdict
│ │ └── … # atlas, greybeard, canon, pedant, static-composition, invariant-canon
│ ├── commands/ # the slash commands
│ │ ├── opsx/ # the spec workflow: propose, apply, verify, archive, explore
│ │ ├── verify-gate.md # the document gate
│ │ └── system-review.md # the whole-system code review
│ └── skills/ # reusable agent skills
│
├── scripts/ # the deterministic loop control, called by the .claude/ hooks
│ ├── review-loop.mjs # record a round, drive the next one (Stop hook), lock the exit (archive gate)
│ └── mutation-survivors.mjs # turn Stryker's output into a plain fix list
│
├── openspec/ # the spec workflow's home
│ ├── config.yaml # the default lane (schema) + rules injected into every artifact
│ ├── schemas/ # the two lanes, as artifact templates
│ │ ├── spec-driven-with-adr/ # default lane: proposal, spec, design, adr, tasks
│ │ └── spec-driven-architecture/ # architecture lane: + use-cases, C4 views, deployment
│ ├── specs/ # the canonical specs: what each capability must do, one folder each
│ │ ├── enrichment/spec.md # example: one capability's contract
│ │ └── … # one folder per capability
│ └── changes/ # one folder per change; while in flight it also holds review-state.json (the loop ledger)
│ └── archive/ # finished changes, kept for history. Two kinds:
│ ├── 2026-05-25-project-backbone/ # example: a CODE change (default lane)
│ │ ├── specs/ # the behavior deltas this change makes
│ │ │ ├── platform-runtime/spec.md
│ │ │ └── background-jobs/spec.md
│ │ ├── .openspec.yaml # which lane/schema + date
│ │ ├── proposal.md # why + what changes
│ │ ├── design.md # how (reuse-aware)
│ │ └── tasks.md # the checklist the AI implemented
│ └── 2026-05-24-c4-level2-architecture/ # example: an ARCHITECTURE change
│ ├── adr/ # decision record(s), before promotion to docs/adr/
│ │ ├── 0003-llm-provider-port.md
│ │ └── 0004-pg-boss-facade.md
│ ├── .openspec.yaml # which lane/schema + date
│ ├── proposal.md # why + what changes
│ ├── use-cases.md # who does what with it
│ ├── domain-model.md # entities + the states they move through
│ ├── system-design.md # the C4 views
│ ├── deployment.md # how it runs
│ ├── tasks.md # the checklist the AI implemented
│ └── verification.md # the verify-gate's record + a hash per artifact
│
├── docs/ # the canon: the durable, human-facing record
│ ├── README.md # how the docs are organised
│ ├── product-overview.md # the north-star product spine (fixed sections, your content)
│ ├── roadmap.md # build sequence + verification posture (fixed slot, your content)
│ ├── process/ # how we build: the harness, spelled out
│ │ ├── engineering.md # the quality harness: the gate, testing, the review loop
│ │ ├── verification-gate.md # the document gate
│ │ ├── system-review.md # the whole-system code review
│ │ ├── review-panel.md # the reviewer roster
│ │ └── module-conventions.md # the layering + naming rules the boundaries enforce
│ ├── architecture/ # the clean, published canon: the architecture lane writes this fixed set, filled for your system
│ │ ├── README.md # how to read the canon + its rules + the integrity check
│ │ ├── system-context.md # C4 level 1: the system in its world
│ │ ├── system-design.md # C4 levels 2-3: containers and components
│ │ ├── domain-model.md # entities, their lifecycle states, and domain events
│ │ ├── cross-cutting.md # concerns that span containers
│ │ ├── glossary.md # the shared vocabulary
│ │ └── canon.manifest.json # machine-checked contract: which canon docs + sections must exist
│ ├── adr/ # decision records: immutable, numbered, superseded not edited
│ │ ├── 0001-background-job-runtime.md # example: one decision + its reasoning
│ │ ├── 0003-llm-provider-port.md # example: another
│ │ └── … # one per decision, and counting
│ ├── explore/ # the research behind decisions, kept apart from the decisions
│ │ ├── README.md # how to write an explore note
│ │ ├── 2026-07-02-review-loop-determinism.md # example: one investigation
│ │ └── … # one per investigation, dated
│ └── reviews/ # system-review records
│ ├── 2026-06-04-system-review-milestone.md # example: one review's findings
│ └── … # one per review, dated
│
└── src/ # example: the product itself, not part of the harness
# Next.js UI + vertical capability slices under src/lib,
# each a port with adapters wired at one composition point
Read top to bottom, the framework is everything above src/: the gate and CI configs, then the agents and hooks that run the reviews and the loop, then the spec workflow and its two lanes, then the canon. Below src/ is the product the whole thing protects.
Footnotes
-
dependency-cruiser, architectural fitness functions for JavaScript and TypeScript. https://github.com/sverweij/dependency-cruiser ↩ ↩2
-
jscpd, copy-paste and duplication detector. https://github.com/kucherenko/jscpd ↩
-
Stryker Mutator, mutation testing for JavaScript and TypeScript. https://stryker-mutator.io ↩