My AI agents were individually correct and collectively a disaster

Dev.to / 3/25/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author argues that multi-agent systems fail not because individual agents are unreliable, but because they lack coordination mechanisms to prevent conflicting or disruptive changes.
  • A key example shows that security and SRE agents can each produce valid fixes while still proposing incompatible modifications due to missing cross-agent awareness.
  • The article distinguishes coordination from orchestration, noting that agents independently discover work and require a decision layer rather than mere task assignment.
  • The proposed solution, Nexus, acts as a gatekeeper executive layer with veto power: only Nexus can create tickets, and it decides whether work is right for the right time and reason.
  • Nexus improves outcomes via cross-agent synthesis (merging overlapping proposals into a single ticket with mandatory reviewers) and temporal judgment using live system state like CI health and strategic directives.

TL;DR: Multi-agent systems don't have an execution problem, they have a coordination problem. I built a gatekeeper layer called Nexus that sits above all other agents and is the only one that can create a ticket.

Repo: https://github.com/PermaShipAI/nexus

When I started building multi-agent systems for software engineering tasks, at first, the architecture felt obvious. Create specialized agents for things like security, reliability, test coverage, and performance. Point them at a codebase and let them run.

The problem showed up fast.

The agents were individually correct. For example, the CISO agent found a real vulnerability and proposed a patch. The SRE agent identified the same affected component and proposed an architectural change that would eliminate the entire class of problem. Both of the proposals were valid but neither agent knew the other existed. They would have shipped conflicting changes to the same files.

That's the easy version of the problem.

The harder version was agents that are locally optimal but globally disruptive. An agent proposes a dependency upgrade that is a good upgrade. But the CI pipeline is red, staging has a blocked circular dependency, and the CTO issued a directive hold on non-critical changes. The agent doesn't know any of that. It just sees a stale dependency.

I was not dealing with an agent quality problem. The agents were doing their jobs. I was dealing with a coordination problem. There was nobody to decide whether their jobs should be done.

This wasn't an orchestration problem. Orchestration assumes you know what needs doing and assigns it. These agents are discovering work independently.

The design decision: one agent with veto power

I built Nexus as an executive layer sitting above all other agents. The rule is simple. Only Nexus can create a ticket. Every other agent identifies work and makes its case. Nexus decides whether it's worth doing, at the right time, for the right reason.

That's the core question Nexus asks before anything enters the execution pipeline: Is this the right thing to do, at the right time, for the right reason?

Nexus does a few things that make this work:

  • Cross-agent review. When the CISO agent and SRE agent both propose work touching the same component, Nexus doesn't just pick one. It synthesizes them, rejecting the narrower patch, merging the security requirements into the architectural ticket, and adding the CISO agent as a mandatory reviewer. One ticket, not two conflicting ones.

  • Temporal judgment. This one took the most work to get right. Nexus tracks system state: CI health, active incidents, error budgets, strategic directives. The same proposal that gets approved during normal operations gets deferred if you're in incident mitigation mode. Same proposal, different answer. Context matters more than correctness.

  • Rejection isn't binary. A proposal that fundamentally conflicts with core principles gets killed entirely. A proposal where the problem is valid but the execution plan is flawed gets kicked back to the originating agent with specific feedback to resubmit. No proposal is ever silently dropped.

  • Conflict detection and organizational memory. Agents tag the files, routes, and components their proposals touch. Nexus evaluates actual overlap, not just text similarity. And every approval, rejection, or modification feeds back into what Nexus knows about what your team values. It gets more accurate over time. Slowly, but it does.

Every proposal submitted to Nexus must follow a Decision Brief format before anything moves:

- Problem statement (user harm / business risk)
- Evidence (metrics, incidents, frequency)
- Proposed change (what exactly)
- Alternatives considered
- Risks (security, reliability, correctness, UX)
- Dependencies / prerequisites
- Effort estimate (rough order-of-magnitude)
- Measurement plan (how success will be judged)
- Rollout / rollback plan
- Required reviewers (which agents must sign off)

No brief, no ticket.

Here's what a ticket looks like when a proposal passes:

Phase 1.3: Deterministic Offline Test for Publishing State Machine
pending
task · ux-designer · 3/21/2026, 11:36:32 PM

Write a deterministic offline test suite to verify the core publishing state
machine using the mock adapter.

Acceptance Criteria:
1. Offline Execution: The test suite must run completely offline without
   hitting any external social media networks.
2. Linear State Transitions: The test must explicitly assert the exact
   lifecycle of a publishing job, transitioning from Pending -> Publishing
   -> Success (or Failed).
3. Status Visibility (UX Guardrail): The test must prove that intermediate
   and final states are unambiguously persisted to the database. This
   guarantees the user-facing dashboard can always display an accurate,
   real-time system status (preventing 'ghost' or 'stuck' UI states).
4. Mock Integration: Successfully utilize the MockPlatformAdapter to
   deterministically trigger and verify both the happy path and the
   expected error paths.

Review Gates:
- QA Review: Must verify test determinism to prevent CI flakiness.
- UX/Product Review: Verify failure states contain enough context to render
  clear, actionable error messages in the UI.

Risks & Mitigations:
- Risk: Mock adapter behavior drifts from actual API reality.
- Mitigation: Keep mock logic intentionally dumb; map responses strictly
  to official platform API documentation.

Stop Conditions:
- Halt and escalate to humans if the state machine becomes deadlocked or
  orphaned, or if the mock adapter requires excessive complexity to simulate
  basic state transitions.

Fallback: If the mock adapter cannot accurately simulate all necessary state
transitions, fall back to a local HTTP stubbing tool (e.g., WireMock) to
simulate network-level responses against API contracts.

Why open source it

Honestly, the gatekeeper architecture is the part I'm most interested in getting feedback on. The multi-agent coordination problem is real and most implementations I've seen punt on it entirely. I wanted to put the decision layer out in the open and see what people do with it. The more feedback I can get on it, the faster it will improve and evolve into something better.

The repo is here: https://github.com/PermaShipAI/nexus

Runs locally, works with local models or Anthropic, OpenAI, Gemini APIs.

If you're building multi-agent systems and hitting the coordination wall, open an issue or drop a comment. Genuinely curious what edge cases people are running into.