Most discussions around AI coding focus on how well models write code.
But in practice, many failures don’t come from bad code.
They come from bad plans.
The Problem
I’ve been using Claude Code and similar tools for real development tasks — data pipelines, Cloud Run jobs, API integrations.
One pattern kept showing up:
The model can implement a flawed plan very convincingly.
The code looks:
- clean
- complete
- structured
- “reasonable”
But later, the system fails because the original plan had issues like:
- hidden assumptions
- missing edge cases
- unclear rollback logic
- dependency failure scenarios
- undefined blast radius
At that point, fixing it is much more expensive.
A Small Experiment
I started trying a simple idea:
Before implementation, red-team the plan itself.
Instead of asking one model to “self-review,” I send the same plan to multiple models:
- Claude
- Codex
- Gemini
Each model reviews the plan independently.
Then I merge the findings into a single report and fix the plan before writing any code.
The Workflow
Here’s the full loop:
- Generate an implementation plan (Claude Code or similar)
- Send the plan to multiple models
- Each model reviews it independently (no shared context)
- Collect findings
- Merge and prioritize issues
- Fix the plan
- Only then start implementation
What to Look For
Each model is essentially trying to break the plan:
- hidden assumptions
- boundary conditions
- dependency failures
- misuse scenarios
- rollback / recovery gaps
- data consistency issues
What I Learned
1. Different models catch different failure modes
This was the biggest surprise.
Each model has its own “bias” in what it notices.
2. The most valuable findings are often unique ones
Not the ones all models agree on.
But the ones only one model catches.
Those usually represent blind spots the others missed.
3. This works better than self-critique
Asking a single model to review its own plan is useful, but limited.
Parallel independent review is much stronger.
A Real Pitfall I Hit
At one point, I asked Claude to generate a plan using a sub-agent.
It spun up a “plan agent” and started working.
Then nothing happened.
I just watched my usage climb… until it hit the 5-hour usage cap.
Zero output.
When I asked what happened, the answer was:
The output was too long and exceeded the response limit, so it kept retrying.
That was a key lesson:
Large outputs should never be returned as chat messages.
They should be written to files, and only summarized in the response.
Minimal Fix (CLAUDE.md)
I ended up adding this to my CLAUDE.md:
## Tool Usage
- Large outputs (>10KB) from subagents or external tools must be written to repo files.
- Do NOT return full content directly in chat.
- Messages should only include:
- file path
- summary
- key findings
- next steps
This Is Not a Framework
This approach is intentionally simple.
- no orchestration system
- no multi-agent framework
- no platform
Just a lightweight pattern:
Add structured doubt before execution.
When This Helps
This is especially useful for:
- data pipelines
- deployment workflows
- API integrations
- systems with rollback or failure cost
When It’s Overkill
You probably don’t need this for:
- quick scripts
- throwaway experiments
- low-risk code
I Wrote It Up
I put the workflow into a small repo here:
https://github.com/permoon/multi-model-redteam
It includes:
- a minimal setup
- red-team review patterns
- example cases
- a simple multi-model review script
Open Question
For people using AI coding tools:
Do you review your implementation plans before coding?
Or do you let the model start building right away?
I’m curious how others handle this.


