AI Is Very Good at Implementing Bad Plans

Dev.to / 5/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The article argues that many AI coding failures stem less from incorrect code generation and more from flawed implementation plans that omit assumptions, edge cases, rollback logic, and failure-scenario coverage.
The author describes a workflow where an initial implementation plan is generated and then “red-teamed” by sending the same plan to multiple independent LLMs (e.g., Claude, Codex, Gemini) before any code is written.
Review findings are collected, merged, and prioritized, and the plan is revised until the major issues are addressed, reducing the cost of later fixes.
The author finds that different models surface different failure modes, and that the most valuable issues are often those discovered by only one model rather than issues all models agree on.
A practical pitfall is also mentioned: a sub-agent that generates a plan may stall while usage climbs, highlighting the need for monitoring and controls around agent-based planning loops.

Most discussions around AI coding focus on how well models write code.
But in practice, many failures don’t come from bad code.
They come from bad plans.

The Problem

I’ve been using Claude Code and similar tools for real development tasks — data pipelines, Cloud Run jobs, API integrations.

One pattern kept showing up:

The model can implement a flawed plan very convincingly.

The code looks:

clean
complete
structured
“reasonable”

But later, the system fails because the original plan had issues like:

hidden assumptions
missing edge cases
unclear rollback logic
dependency failure scenarios
undefined blast radius

At that point, fixing it is much more expensive.

A Small Experiment

I started trying a simple idea:

Before implementation, red-team the plan itself.

Instead of asking one model to “self-review,” I send the same plan to multiple models:

Claude
Codex
Gemini

Each model reviews the plan independently.

Then I merge the findings into a single report and fix the plan before writing any code.

The Workflow

Here’s the full loop:

Generate an implementation plan (Claude Code or similar)
Send the plan to multiple models
Each model reviews it independently (no shared context)
Collect findings
Merge and prioritize issues
Fix the plan
Only then start implementation

What to Look For

Each model is essentially trying to break the plan:

hidden assumptions
boundary conditions
dependency failures
misuse scenarios
rollback / recovery gaps
data consistency issues

What I Learned

1. Different models catch different failure modes

This was the biggest surprise.

Each model has its own “bias” in what it notices.

2. The most valuable findings are often unique ones

Not the ones all models agree on.

But the ones only one model catches.

Those usually represent blind spots the others missed.

3. This works better than self-critique

Asking a single model to review its own plan is useful, but limited.

Parallel independent review is much stronger.

A Real Pitfall I Hit

At one point, I asked Claude to generate a plan using a sub-agent.

It spun up a “plan agent” and started working.

Then nothing happened.

I just watched my usage climb… until it hit the 5-hour usage cap.

Zero output.

When I asked what happened, the answer was:

The output was too long and exceeded the response limit, so it kept retrying.

That was a key lesson:

Large outputs should never be returned as chat messages.

They should be written to files, and only summarized in the response.

Minimal Fix (CLAUDE.md)

I ended up adding this to my CLAUDE.md:

## Tool Usage
- Large outputs (>10KB) from subagents or external tools must be written to repo files.
- Do NOT return full content directly in chat.
- Messages should only include:
  - file path
  - summary
  - key findings
  - next steps

This Is Not a Framework

This approach is intentionally simple.

no orchestration system
no multi-agent framework
no platform

Just a lightweight pattern:

Add structured doubt before execution.

When This Helps

This is especially useful for:

data pipelines
deployment workflows
API integrations
systems with rollback or failure cost

When It’s Overkill

You probably don’t need this for:

quick scripts
throwaway experiments
low-risk code

I Wrote It Up

I put the workflow into a small repo here:

https://github.com/permoon/multi-model-redteam

It includes:

a minimal setup
red-team review patterns
example cases
a simple multi-model review script

Open Question

For people using AI coding tools:

Do you review your implementation plans before coding?

Or do you let the model start building right away?

I’m curious how others handle this.

Black Hat USA

AI Business

Top 10 Reliable Platforms Buy Purchasing Gmail Accounts ...

Dev.to

Hybrid LLM Routing: Ollama + Claude API Without Quality Degradation

Dev.to

How to Switch AI Models Without Rebuilding Your Agent

Dev.to

How to Mix Fast and Deep AI Models in One Agent (And Cut Your Bill 80%)

Dev.to

AI Is Very Good at Implementing Bad Plans

Key Points

The Problem

A Small Experiment

The Workflow

What to Look For

What I Learned

1. Different models catch different failure modes

2. The most valuable findings are often unique ones

3. This works better than self-critique

A Real Pitfall I Hit

Minimal Fix (CLAUDE.md)

This Is Not a Framework

When This Helps

When It’s Overkill

I Wrote It Up

Open Question

Related Articles

Black Hat USA

Top 10 Reliable Platforms Buy Purchasing Gmail Accounts ...

Hybrid LLM Routing: Ollama + Claude API Without Quality Degradation

How to Switch AI Models Without Rebuilding Your Agent

How to Mix Fast and Deep AI Models in One Agent (And Cut Your Bill 80%)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer