AI Is Very Good at Implementing Bad Plans

Dev.to / 5/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article argues that many AI coding failures stem less from incorrect code generation and more from flawed implementation plans that omit assumptions, edge cases, rollback logic, and failure-scenario coverage.
  • The author describes a workflow where an initial implementation plan is generated and then “red-teamed” by sending the same plan to multiple independent LLMs (e.g., Claude, Codex, Gemini) before any code is written.
  • Review findings are collected, merged, and prioritized, and the plan is revised until the major issues are addressed, reducing the cost of later fixes.
  • The author finds that different models surface different failure modes, and that the most valuable issues are often those discovered by only one model rather than issues all models agree on.
  • A practical pitfall is also mentioned: a sub-agent that generates a plan may stall while usage climbs, highlighting the need for monitoring and controls around agent-based planning loops.

Most discussions around AI coding focus on how well models write code.
But in practice, many failures don’t come from bad code.
They come from bad plans.

The Problem

I’ve been using Claude Code and similar tools for real development tasks — data pipelines, Cloud Run jobs, API integrations.

One pattern kept showing up:

The model can implement a flawed plan very convincingly.

The code looks:

  • clean
  • complete
  • structured
  • “reasonable”

But later, the system fails because the original plan had issues like:

  • hidden assumptions
  • missing edge cases
  • unclear rollback logic
  • dependency failure scenarios
  • undefined blast radius

At that point, fixing it is much more expensive.

A Small Experiment

I started trying a simple idea:

Before implementation, red-team the plan itself.

Instead of asking one model to “self-review,” I send the same plan to multiple models:

  • Claude
  • Codex
  • Gemini

Each model reviews the plan independently.

Then I merge the findings into a single report and fix the plan before writing any code.

The Workflow

Here’s the full loop:

  1. Generate an implementation plan (Claude Code or similar)
  2. Send the plan to multiple models
  3. Each model reviews it independently (no shared context)
  4. Collect findings
  5. Merge and prioritize issues
  6. Fix the plan
  7. Only then start implementation

What to Look For

Each model is essentially trying to break the plan:

  • hidden assumptions
  • boundary conditions
  • dependency failures
  • misuse scenarios
  • rollback / recovery gaps
  • data consistency issues

What I Learned

1. Different models catch different failure modes

This was the biggest surprise.

Each model has its own “bias” in what it notices.

2. The most valuable findings are often unique ones

Not the ones all models agree on.

But the ones only one model catches.

Those usually represent blind spots the others missed.

3. This works better than self-critique

Asking a single model to review its own plan is useful, but limited.

Parallel independent review is much stronger.

A Real Pitfall I Hit

At one point, I asked Claude to generate a plan using a sub-agent.

It spun up a “plan agent” and started working.

Then nothing happened.

I just watched my usage climb… until it hit the 5-hour usage cap.

Zero output.

When I asked what happened, the answer was:

The output was too long and exceeded the response limit, so it kept retrying.

That was a key lesson:

Large outputs should never be returned as chat messages.

They should be written to files, and only summarized in the response.

Minimal Fix (CLAUDE.md)

I ended up adding this to my CLAUDE.md:

## Tool Usage
- Large outputs (>10KB) from subagents or external tools must be written to repo files.
- Do NOT return full content directly in chat.
- Messages should only include:
  - file path
  - summary
  - key findings
  - next steps

This Is Not a Framework

This approach is intentionally simple.

  • no orchestration system
  • no multi-agent framework
  • no platform

Just a lightweight pattern:

Add structured doubt before execution.

When This Helps

This is especially useful for:

  • data pipelines
  • deployment workflows
  • API integrations
  • systems with rollback or failure cost

When It’s Overkill

You probably don’t need this for:

  • quick scripts
  • throwaway experiments
  • low-risk code

I Wrote It Up

I put the workflow into a small repo here:

https://github.com/permoon/multi-model-redteam

It includes:

  • a minimal setup
  • red-team review patterns
  • example cases
  • a simple multi-model review script

Open Question

For people using AI coding tools:

Do you review your implementation plans before coding?

Or do you let the model start building right away?

I’m curious how others handle this.