CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper revisits multi-agent delegation and argues that an agent’s effective capability varies with task context rather than remaining fixed as a static skill profile.
  • It introduces CADMAS-CTX, which learns hierarchical, context-conditioned Beta posteriors per agent to capture experience across coarse context buckets.
  • Delegation decisions are made with a risk-aware scoring rule that uses the posterior mean plus an uncertainty penalty, aiming to route tasks only when evidence supports one agent being better.
  • The authors provide theoretical guarantees via contextual bandit analysis, proving lower cumulative regret for context-aware routing under sufficient context heterogeneity.
  • Experiments on GAIA and SWE-bench show consistent gains (GAIA accuracy: 0.442 vs 0.381 static baseline; SWE-bench Lite resolve rate: 22.3% → 31.4%), and ablations confirm the uncertainty penalty helps with context-tagging noise.

Abstract

We revisit multi-agent delegation under a stronger and more realistic assumption: an agent's capability is not fixed at the skill level, but depends on task context. A coding agent may excel at short standalone edits yet fail on long-horizon debugging; a planner may perform well on shallow tasks yet degrade on chained dependencies. Static skill-level capability profiles therefore average over heterogeneous situations and can induce systematic misdelegation. We propose CADMAS-CTX, a framework for contextual capability calibration. For each agent, skill, and coarse context bucket, CADMAS-CTX maintains a Beta posterior that captures stable experience in that part of the task space. Delegation is then made by a risk-aware score that combines the posterior mean with an uncertainty penalty, so that agents delegate only when a peer appears better and that assessment is sufficiently well supported by evidence. This paper makes three contributions. First, a hierarchical contextual capability profile replaces static skill-level confidence with context-conditioned posteriors. Second, based on contextual bandit theory, we formally prove context-aware routing achieves lower cumulative regret than static routing under sufficient context heterogeneity, formalizing the bias-variance tradeoff. Third, we empirically validate our method on GAIA and SWE-bench benchmarks. On GAIA with GPT-4o agents, CADMAS-CTX achieves 0.442 accuracy, outperforming static baseline 0.381 and AutoGen 0.354 with non-overlapping 95% confidence intervals. On SWE-bench Lite, it improves resolve rate from 22.3% to 31.4%. Ablations show the uncertainty penalty improves robustness against context tagging noise. Our results demonstrate contextual calibration and risk-aware delegation significantly improve multi-agent teamwork compared with static global skill assignments.