Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

arXiv cs.AI / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a robust multi-agent reinforcement learning framework for collaborative reasoning that addresses both interaction-level ambiguity between agents and unstable learning under heavy-tailed/noisy rewards.
It introduces a Dual-Agent Answer-Critique-Rewrite (DACR) pipeline that structures reasoning into answer, critique, and rewrite stages while enabling explicit credit assignment via marginal contribution across agents.
It also introduces an Adaptive Robust Estimator (ARE) to compute more reliable batch experience means during multi-agent policy optimization, aiming to correct biased advantage estimation from noisy reward signals.
Experiments on mathematical reasoning and embodied intelligence benchmarks show consistent improvements over baselines in both homogeneous and heterogeneous multi-agent settings, including scenarios with significant reward noise.
The authors report that the approach improves training stability and helps prevent optimization failures that can arise from noisy or heavy-tailed reward distributions.

Abstract

Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent's marginal contribution to its partner's performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

Dev.to

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

Dev.to

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

Dev.to

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

THE DECODER

Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

Key Points

Abstract

Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer