Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes SeqComm-DFL, a multi-agent method that optimizes communication messages for downstream decision quality rather than intermediate communication objectives like reconstruction or mutual information.
  • SeqComm-DFL generates messages in a priority order using sequential Stackelberg conditioning, letting agents’ message generation and decisions account for what prior agents communicate.
  • It integrates the communication mechanism into communication-augmented world models, extending Optimal Model Design with QMIX factorization to support efficient end-to-end training via implicit differentiation.
  • The authors provide information-theoretic bounds on communication value and show an 9(1/\sqrt{T}) convergence rate for the associated bilevel optimization objective.
  • On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC), SeqComm-DFL reports 4–6x gains in cumulative rewards and over 13% win-rate improvements compared with approaches that do not align messaging with decision-focused learning.

Abstract

Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbf{SeqComm-DFL}, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emph{value-aware message generation with sequential Stackelberg conditioning}: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emph{guidance potential} determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish \mathcal{O}(1/\sqrt{T}) convergence for the bilevel optimization, where T denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13\% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.