Last night at AI Tinkerers, someone audited my multi-agent system in front of the room. Not a demo. Not a presentation. An actual architectural assessment using knowledge-graph analysis, scored against established maturity frameworks.
The system has 39 specialized agents across five categories, defined governance protocols, six workflow types with 8-17 steps each, and a dedicated evolution loop for continuous improvement. I've been building it for months.
The Audit
Marcus Waldman ran his iConsult tool against the full architecture. The tool maps agent definitions, workflow structures, and coordination patterns into a knowledge graph, then scores them against patterns from Arsanjani and Bustos's work on agentic architectural patterns.
Here's what came back:
| Category | Rating |
|---|---|
| Coordination & Planning | Established |
| Explainability & Compliance | Emerging |
| Robustness & Fault Tolerance | Not Started |
| Human-Agent Interaction | Emerging |
| Agent-Level Capabilities | Not Started |
| System-Level Infrastructure | Not Started |
| Continuous Improvement | Emerging |
Failure chain coverage: 20%. One of five steps in the automated recovery chain existed. The rest were missing entirely.
A system with 39 agents, a three-tier supervisor hierarchy, and dedicated auditor and sentinel agents scored "Not Started" on robustness. That is the gap most teams aren't talking about.
Arsanjani's 6 Levels of Agent Maturity
Ali Arsanjani (Google Cloud) published a maturity model that maps where agent systems actually fall on a capability spectrum. Most of us think we're higher than we are.
Level 0: No Agents. Traditional software. No autonomous components.
Level 1: Single Agent with Tools. One LLM with function calling. This is where most "agentic" products actually live. The agent can use tools but has no planning, no memory beyond the conversation, and no coordination with other agents.
Level 2: Multi-Agent Coordination. Multiple agents with defined roles and handoff patterns. Supervisor or router dispatches work. This is where the orchestration problem starts to bite.
Level 3: Autonomous Planning. Agents can decompose tasks, create plans, and execute them with minimal human oversight. The system handles multi-step workflows without constant prompting.
Level 4: Adaptive Systems. Agents learn from outcomes, adjust strategies, and improve over time. Self-evaluation loops. Performance metrics that feed back into behavior.
Level 5: Bureaucracy of Agents. Dedicated oversight agents. Auditors. Inspectors. Governance structures that exist specifically to monitor and evaluate other agents. This is the level that sounds like overkill until you realize it's the only way to maintain reliability at scale.
My system has governance agents. It has an auditor, a sentinel, an evaluator, and a coherence checker. On paper, it touches Level 5. In practice, the audit showed the governance layer is partially built but the infrastructure underneath it (automated recovery, dynamic registry, event bus) doesn't exist yet.
You can have the org chart without the plumbing. The maturity model measures the plumbing.
Why Majority Voting Fails
There's a related finding from the AgentAuditor paper (USC, February 2026) that connects directly to this maturity problem.
The standard approach to multi-agent reliability is majority voting. Run the same task through multiple agents, take the consensus answer. Sounds reasonable. It's also broken.
The problem is correlated bias. When agents share the same training data and similar reasoning patterns, they don't produce independent votes. They converge on the same wrong answer. Majority voting fails for the same reason groupthink fails in organizations. More voices doesn't help when they all share the same blind spots.
AgentAuditor's approach was to map reasoning trees and search for path divergences instead of counting votes. The result: 5% accuracy improvement over majority voting. Not because the individual agents were better, but because the auditing structure was better.
This is exactly the gap the audit exposed in my system. I have a sentinel and an auditor, but they're watching for rule violations, not reasoning divergences. The governance layer checks process. It doesn't check whether agents are converging on the same blind spot. That's a different kind of auditing entirely.
The lesson: you don't fix reliability by adding more agents. You fix it by adding structural auditing that can identify where reasoning paths diverge. It's a coordination architecture problem, not a scaling problem.
The Numbers Behind the Hype
Gartner reported a 1,445% surge in multi-agent inquiries. At the same time, they project 40% of agentic AI projects will be cancelled by 2027. Only about 130 out of thousands of vendors in the space are building real multi-agent capabilities.
Deloitte estimates the market at $8.5B in 2026, growing to $35-45B by 2030. But those numbers assume proper orchestration. Without it, you get the 40% cancellation rate.
The demand-reality gap isn't about model capability. GPT-4, Claude, Gemini can all handle complex reasoning. The bottleneck is orchestration maturity. How do you coordinate agents? How do you detect failures? How do you recover? How do you know your system is actually working as designed?
Most teams skip these questions because they're not as exciting as adding another agent.
Self-Assessment
If you're building a multi-agent system, here are the questions worth asking:
What level are you actually at? Not what your architecture diagram suggests. What does the running system demonstrate?
Can your system detect its own failures? Not log them. Detect them in real time and route them to recovery logic.
How do you audit agent behavior? If the answer is "we read the logs," you're at Level 1 maturity for observability regardless of how many agents you have.
What happens when an agent produces wrong output? Does the system catch it? Or does it propagate through the pipeline?
Is your governance layer structural or decorative? Having an "auditor agent" in the config is different from having an auditor agent that actually interrupts workflows when quality drops.
I had to answer these questions publicly last night. That's the value of external assessment. Your own evaluation will always be generous.
What I'm Doing About It
The audit produced a concrete implementation plan. Phase 1 is the robustness gap: circuit breakers, retry policies, health checks, and a failure chain that actually covers all five steps. The coordination score was reasonable because the supervisor architecture and workflow definitions are solid. But coordination without robustness is a system that works until it doesn't, and when it fails, there's nothing to catch it.
The maturity model isn't a checklist to complete. It's a map for knowing where you actually are and what to build next. The frameworks exist. The assessment tools are getting better. The question is whether you're willing to run the audit.
I build Sigil, an open-source symbolic computation framework, and write about systems architecture on Substack.
![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3618325%252F470cf6d0-e54c-4ddf-8d83-e3db9f829f2b.jpg&w=3840&q=75)



