Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning

arXiv cs.LG / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces CLOVER, a cooperative multi-agent reinforcement learning framework that conditions centralized value decomposition on the realized inter-agent communication graph under realistic wireless channels.
It uses a GNN-based value mixer with node-specific weights generated by a permutation-equivariant hypernetwork, enabling multi-hop message propagation that changes credit assignment according to topology.
The authors prove key properties of the mixer, including permutation invariance, monotonicity with respect to the IGM condition, and higher expressiveness than QMIX-style mixers.
To address stochastic wireless effects, the method introduces an augmented MDP and uses a stochastic receptive field encoder to support variable-size message sets with end-to-end differentiable training.
Experiments on Predator-Prey and Lumberjacks under p-CSMA channels show CLOVER improves convergence speed and final performance over several baselines, with behavioral and ablation studies attributing gains to the communication-graph inductive bias.

Abstract

Cooperation in multi-agent reinforcement learning (MARL) benefits from inter-agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork: multi-hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX-style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable-size message sets, enabling end-to-end differentiable training. On Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication-graph inductive bias as the key source of improvement.