CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper revisits multi-agent delegation and argues that an agent’s effective capability varies with task context rather than remaining fixed as a static skill profile.
It introduces CADMAS-CTX, which learns hierarchical, context-conditioned Beta posteriors per agent to capture experience across coarse context buckets.
Delegation decisions are made with a risk-aware scoring rule that uses the posterior mean plus an uncertainty penalty, aiming to route tasks only when evidence supports one agent being better.
The authors provide theoretical guarantees via contextual bandit analysis, proving lower cumulative regret for context-aware routing under sufficient context heterogeneity.
Experiments on GAIA and SWE-bench show consistent gains (GAIA accuracy: 0.442 vs 0.381 static baseline; SWE-bench Lite resolve rate: 22.3% → 31.4%), and ablations confirm the uncertainty penalty helps with context-tagging noise.

Abstract

We revisit multi-agent delegation under a stronger and more realistic assumption: an agent's capability is not fixed at the skill level, but depends on task context. A coding agent may excel at short standalone edits yet fail on long-horizon debugging; a planner may perform well on shallow tasks yet degrade on chained dependencies. Static skill-level capability profiles therefore average over heterogeneous situations and can induce systematic misdelegation. We propose CADMAS-CTX, a framework for contextual capability calibration. For each agent, skill, and coarse context bucket, CADMAS-CTX maintains a Beta posterior that captures stable experience in that part of the task space. Delegation is then made by a risk-aware score that combines the posterior mean with an uncertainty penalty, so that agents delegate only when a peer appears better and that assessment is sufficiently well supported by evidence. This paper makes three contributions. First, a hierarchical contextual capability profile replaces static skill-level confidence with context-conditioned posteriors. Second, based on contextual bandit theory, we formally prove context-aware routing achieves lower cumulative regret than static routing under sufficient context heterogeneity, formalizing the bias-variance tradeoff. Third, we empirically validate our method on GAIA and SWE-bench benchmarks. On GAIA with GPT-4o agents, CADMAS-CTX achieves 0.442 accuracy, outperforming static baseline 0.381 and AutoGen 0.354 with non-overlapping 95% confidence intervals. On SWE-bench Lite, it improves resolve rate from 22.3% to 31.4%. Ablations show the uncertainty penalty improves robustness against context tagging noise. Our results demonstrate contextual calibration and risk-aware delegation significantly improve multi-agent teamwork compared with static global skill assignments.

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

The Register

v0.21.1

Ollama Releases

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)

Dev.to

CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

Key Points

Abstract

Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

v0.21.1

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer