AI Navigate

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

Dev.to / 3/22/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The spec is the contract, not the prompt, and should include exact old/new code blocks, a Definition of Done checklist, a commit message, and a What NOT to Do section to guide the agent’s work.
  • The discipline layer enforces constraints through pre-commit hooks, protected branches, and smoke-test gates to prevent agents from bypassing rules or misreporting progress.
  • The judgment layer is non-delegatable, requiring human oversight to validate outcomes beyond what an agent reports.
  • Implementing this approach involves a simple skeleton (memory-bank/, CLAUDE.md, etc.) and explicit enforcement to ensure reliable multi-agent workflows across Codex, Gemini, and Claude.

The Copyable Part

I've written about running Claude, Codex, and Gemini on the same codebase. The response I get most often is: "How do I set this up?"

The setup is the easy part. Here it is:

memory-bank/
  activeContext.md     # what's true right now
  progress.md          # what's done, what's pending
docs/plans/            # task specs for each agent
CLAUDE.md              # codebase rules + agent instructions
scripts/hooks/         # pre-commit enforcement

Copy that structure into your repo. Write a CLAUDE.md. Create a memory-bank/. You now have the skeleton.

But the skeleton isn't why it works.

What Actually Makes It Work

Three things that don't fit in a file structure:

1. The spec is the contract, not the prompt.

When I hand a task to Codex, I don't say "add a function that registers an ArgoCD cluster." I write a spec with exact old/new code blocks, a Definition of Done checklist, a commit message, and an explicit "What NOT to Do" section.

The spec is what Codex reads. The spec is what I verify against. If the diff doesn't match the spec, the task isn't done — regardless of what Codex reports.

This sounds obvious. It isn't. Most people hand agents a description and trust the output. That works until the agent helpfully refactors something adjacent, updates a file it wasn't supposed to touch, or reports done after completing only the first of three required changes.

The spec eliminates the interpretation gap.

2. The discipline layer enforces what the prompt can't.

Prompts fail at handoff points. I've had it happen with every agent:

  • Codex updated memory-bank/ despite explicit prohibition — because the content was accurate and it was "being helpful"
  • Gemini reported done after running ssh -fN manually instead of tunnel_start — because the outcome looked the same
  • Claude (me) moved fast on a PR before Gemini smoke test — because CI was green and it felt done

The fix isn't a better prompt. It's enforcement that runs regardless of what any agent decides:

  • Pre-commit hooks — shellcheck, BATS, placeholder URL checks — run on every commit from every agent
  • Branch protection — enforce_admins on, Copilot review required — no exceptions
  • Smoke test gate — bin/smoke-test-cluster-health.sh before any "done" report is accepted

You can't rationalize past a pre-commit hook. You can only bypass it with --no-verify, which is an explicit act I can see in the diff.

3. The judgment layer isn't delegatable.

This is the one the file structure can't give you.

Someone asked me recently: how hard would it be for a Sr SRE to write a k3d-manager plugin? With AI help — a few hours. Write the spec, Claude scaffolds the structure, Codex implements, Gemini verifies. Accessible.

How hard to write a new cloud provider? Days of work — even with AI. Not because the Bash is hard, but because the decisions are:

  • Why EKS before GKE before AKS?
  • Why Longhorn for stateful workloads instead of EBS volumes?
  • Why AWS Managed AD alongside EKS instead of a separate phase?
  • What does "done" mean for a provider that has to survive Vault init, ESO sync, ArgoCD cluster registration, and Playwright E2E?

I knew those answers because I've operated these systems. The agents implemented what I specified. If I'd handed them "build an EKS provider" without the decomposition, they would have built something — and it would have been plausible and wrong in ways I wouldn't catch until week three.

AI lowers the implementation floor. It doesn't raise the spec-writing ceiling.

What Breaks Without the Judgment Layer

The failure mode isn't dramatic. It's gradual.

An agent makes a reasonable-looking decision that's subtly wrong for your specific context. Another agent builds on top of it. The error compounds. Three sessions later you're debugging something that shouldn't be possible given what the agents reported.

I've seen it happen with:

  • A cluster secret applied without a bearer token — ArgoCD showed Unknown for two days before we traced it
  • A kubeconfig that worked on M4 but silently failed on M2 because 127.0.0.1 doesn't cross machines
  • A BATS test that passed because the stub function masked the actual logic being tested

None of these were agent errors in the traditional sense. The agents did what they were told. The specs had gaps that required domain knowledge to close.

The judgment layer is the person who knows what questions to ask before writing the spec.

Security Is the Judgment Layer in Practice

Everyone says security is a first priority. The actual behavior tells a different story.

Watch any AI coding demo on YouTube. Watch the LinkedIn posts about shipping with agents. The narrative is velocity — what shipped, how fast, what the agent built. Nobody posts "I spent two weeks on secrets scanning and nothing broke." There's no demo for the incident that didn't happen.

The incentive is visibility, and security is invisible until it isn't.

Here's what happened in this codebase during a normal working session: an agent committed a credentials file. Not maliciously — it was working fast, the file existed, it staged everything. The push went through. A few minutes later, a GitGuardian alert landed in my inbox — an external scanner had detected the exposed credential in the public repo before I did. The fix required git filter-repo to surgically rewrite history, a force push, and re-cloning every local copy. Thirty minutes of surgery for a two-second mistake.

The alert was the lucky part. GitGuardian catches secrets in both public and private repos — but only after they're already pushed. The credential was already in history by the time the alert fired. Without GitGuardian enabled, it would have sat there silently until someone found it the hard way.

The structural fix is four lines in .pre-commit-config.yaml:

- repo: https://github.com/gitleaks/gitleaks
  rev: v8.18.2
  hooks:
    - id: gitleaks

That hook runs before every commit from every agent. It doesn't care what the agent decided. It blocks the commit and surfaces exactly what matched and on which line. The credential never lands. Zero velocity cost once it's in place — the cost is front-loading the setup.

This is the pattern that doesn't show up in the demos: security is cheap when it's structural and expensive when it's reactive. Pre-commit hooks, branch protection, Vault for secrets instead of env files, audit logging — none of these slow down development when they're built into the workflow from day one. They only slow you down when you try to bolt them on after an incident.

The agents move as fast as the guardrails allow. If the guardrails aren't there, the agents will find the gap — not because they're careless, but because they're optimizing for task completion, not for the blast radius of a credential in a public repo.

The Starter Template (What You Can Copy)

If you want to adopt this protocol, here's what's actually transferable:

File structure:

CLAUDE.md              # rules + layout + security requirements
memory-bank/
  activeContext.md     # current state snapshot
  progress.md          # what's done, pending, blocked
docs/plans/            # one spec file per agent task
docs/issues/           # post-mortems on real failures
scripts/hooks/         # pre-commit enforcement
bin/smoke-test-*.sh    # health gates for verifying agent work

Spec template (for Codex):

## Before You Start
## Problem
## What to Build (exact code blocks)
## Rules (shellcheck, BATS gates)
## Definition of Done (checklist + exact commit message)
## What NOT to Do

Spec template (for Gemini):

## Before You Start
## Context
## Step N — [action] (with exact commands)
## Definition of Done (with actual output requirements)
## What NOT to Do

The verification protocol (after every agent reports done):

  1. SHA exists: git log <branch> --oneline | grep <sha>
  2. Diff matches spec: git show <sha> --stat
  3. Only spec-listed files touched
  4. Gates pass: shellcheck, BATS, smoke test

The Part That Doesn't Transfer

The framework is a tool. The tool amplifies what you bring to it.

If you can write a clear spec — specific, testable, with explicit boundaries — the agents will implement it accurately and you'll catch deviations fast. If you can't, the agents will fill in the gaps with plausible-looking decisions that accumulate into subtle wrongness.

The question isn't "can AI build this?" It's "can you define done well enough that AI knows when it's there?"

That's a judgment question. It requires knowing the system, knowing the failure modes, knowing what adjacent decisions will constrain future options.

k3d-manager is my proof of concept. Three agents, one codebase, eighteen months of Kubernetes infrastructure. The agents wrote most of the code. I wrote every spec.

The workflow is the product. The discipline is the moat.

The Access Problem Is Next

This protocol solves the coordination problem — how agents handoff work, stay in scope, and verify each other's output.

There's a related problem it doesn't solve: access. Every external API an agent needs still requires a human to sign up, copy a token, and attach a credit card. Coinbase's x402 protocol is attempting to fix this at the HTTP layer — an agent hits an endpoint, gets back a price and a wallet address, pays in USDC, gets access. No signup, no dashboard, no human in the loop.

It's not relevant to most codebases today. But if the remote MCP endpoint model takes hold — agents calling shared infrastructure APIs rather than running tools locally — x402 is the monetization primitive that makes those endpoints commercially viable without signup friction. The coordination layer (this framework) and the access layer (x402) are two different parts of the same infrastructure gap.

Try It

The full framework — CLAUDE.md template, memory-bank structure, spec templates, pre-commit hooks — is open source at github.com/wilddog64/k3d-manager.

The file structure is free. The judgment is yours to build.