Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to / 3/28/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

A live audit at an event evaluated a 39-agent multi-agent architecture using a knowledge-graph approach and a maturity scoring framework derived from published agentic architectural patterns.
The system was assessed as “Established” in coordination and planning, but only “Emerging” in explainability and compliance, with major gaps in robustness, fault tolerance, agent-level capabilities, and system-level infrastructure.
The audit found limited automated failure-chain coverage (20%), noting that only one of five recovery steps existed, leaving most recovery pathways missing.
The author situates the findings within Arsanjani’s multi-level agent maturity model, arguing that many teams overestimate their current maturity level.
The piece emphasizes that even with complex structure (e.g., 39 agents and supervisor hierarchies), resilience and operational readiness remain the critical under-addressed areas.

Last night at AI Tinkerers, someone audited my multi-agent system in front of the room. Not a demo. Not a presentation. An actual architectural assessment using knowledge-graph analysis, scored against established maturity frameworks.

The system has 39 specialized agents across five categories, defined governance protocols, six workflow types with 8-17 steps each, and a dedicated evolution loop for continuous improvement. I've been building it for months.

The Audit

Marcus Waldman ran his iConsult tool against the full architecture. The tool maps agent definitions, workflow structures, and coordination patterns into a knowledge graph, then scores them against patterns from Arsanjani and Bustos's work on agentic architectural patterns.

Here's what came back:

Category	Rating
Coordination & Planning	Established
Explainability & Compliance	Emerging
Robustness & Fault Tolerance	Not Started
Human-Agent Interaction	Emerging
Agent-Level Capabilities	Not Started
System-Level Infrastructure	Not Started
Continuous Improvement	Emerging

Failure chain coverage: 20%. One of five steps in the automated recovery chain existed. The rest were missing entirely.

A system with 39 agents, a three-tier supervisor hierarchy, and dedicated auditor and sentinel agents scored "Not Started" on robustness. That is the gap most teams aren't talking about.

Arsanjani's 6 Levels of Agent Maturity

Ali Arsanjani (Google Cloud) published a maturity model that maps where agent systems actually fall on a capability spectrum. Most of us think we're higher than we are.

Level 0: No Agents. Traditional software. No autonomous components.

Level 1: Single Agent with Tools. One LLM with function calling. This is where most "agentic" products actually live. The agent can use tools but has no planning, no memory beyond the conversation, and no coordination with other agents.

Level 2: Multi-Agent Coordination. Multiple agents with defined roles and handoff patterns. Supervisor or router dispatches work. This is where the orchestration problem starts to bite.

Level 3: Autonomous Planning. Agents can decompose tasks, create plans, and execute them with minimal human oversight. The system handles multi-step workflows without constant prompting.

Level 4: Adaptive Systems. Agents learn from outcomes, adjust strategies, and improve over time. Self-evaluation loops. Performance metrics that feed back into behavior.

Level 5: Bureaucracy of Agents. Dedicated oversight agents. Auditors. Inspectors. Governance structures that exist specifically to monitor and evaluate other agents. This is the level that sounds like overkill until you realize it's the only way to maintain reliability at scale.

My system has governance agents. It has an auditor, a sentinel, an evaluator, and a coherence checker. On paper, it touches Level 5. In practice, the audit showed the governance layer is partially built but the infrastructure underneath it (automated recovery, dynamic registry, event bus) doesn't exist yet.

You can have the org chart without the plumbing. The maturity model measures the plumbing.

Why Majority Voting Fails

There's a related finding from the AgentAuditor paper (USC, February 2026) that connects directly to this maturity problem.

The standard approach to multi-agent reliability is majority voting. Run the same task through multiple agents, take the consensus answer. Sounds reasonable. It's also broken.

The problem is correlated bias. When agents share the same training data and similar reasoning patterns, they don't produce independent votes. They converge on the same wrong answer. Majority voting fails for the same reason groupthink fails in organizations. More voices doesn't help when they all share the same blind spots.

AgentAuditor's approach was to map reasoning trees and search for path divergences instead of counting votes. The result: 5% accuracy improvement over majority voting. Not because the individual agents were better, but because the auditing structure was better.

This is exactly the gap the audit exposed in my system. I have a sentinel and an auditor, but they're watching for rule violations, not reasoning divergences. The governance layer checks process. It doesn't check whether agents are converging on the same blind spot. That's a different kind of auditing entirely.

The lesson: you don't fix reliability by adding more agents. You fix it by adding structural auditing that can identify where reasoning paths diverge. It's a coordination architecture problem, not a scaling problem.

The Numbers Behind the Hype

Gartner reported a 1,445% surge in multi-agent inquiries. At the same time, they project 40% of agentic AI projects will be cancelled by 2027. Only about 130 out of thousands of vendors in the space are building real multi-agent capabilities.

Deloitte estimates the market at $8.5B in 2026, growing to $35-45B by 2030. But those numbers assume proper orchestration. Without it, you get the 40% cancellation rate.

The demand-reality gap isn't about model capability. GPT-4, Claude, Gemini can all handle complex reasoning. The bottleneck is orchestration maturity. How do you coordinate agents? How do you detect failures? How do you recover? How do you know your system is actually working as designed?

Most teams skip these questions because they're not as exciting as adding another agent.

Self-Assessment

If you're building a multi-agent system, here are the questions worth asking:

What level are you actually at? Not what your architecture diagram suggests. What does the running system demonstrate?
Can your system detect its own failures? Not log them. Detect them in real time and route them to recovery logic.
How do you audit agent behavior? If the answer is "we read the logs," you're at Level 1 maturity for observability regardless of how many agents you have.
What happens when an agent produces wrong output? Does the system catch it? Or does it propagate through the pipeline?
Is your governance layer structural or decorative? Having an "auditor agent" in the config is different from having an auditor agent that actually interrupts workflows when quality drops.

I had to answer these questions publicly last night. That's the value of external assessment. Your own evaluation will always be generous.

What I'm Doing About It

The audit produced a concrete implementation plan. Phase 1 is the robustness gap: circuit breakers, retry policies, health checks, and a failure chain that actually covers all five steps. The coordination score was reasonable because the supervisor architecture and workflow definitions are solid. But coordination without robustness is a system that works until it doesn't, and when it fails, there's nothing to catch it.

The maturity model isn't a checklist to complete. It's a map for knowing where you actually are and what to build next. The frameworks exist. The assessment tools are getting better. The question is whether you're willing to run the audit.

I build Sigil, an open-source symbolic computation framework, and write about systems architecture on Substack.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/28DailyView insight →

[Boost]

Dev.to

Managing LLM context in a real application

Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Dev.to

The Redline Economy

Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Key Points

The Audit

Arsanjani's 6 Levels of Agent Maturity

Why Majority Voting Fails

The Numbers Behind the Hype

Self-Assessment

What I'm Doing About It

💡 Insights using this article

Related Articles

[Boost]

Managing LLM context in a real application

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

The Redline Economy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer