Multi-Modal Manipulation via Multi-Modal Policy Consensus

arXiv cs.RO / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses limitations of common multimodal robotic manipulation methods, arguing that simple feature concatenation can let dominant sensors (e.g., vision) overwhelm crucial but sparse signals (e.g., touch).
It proposes a multimodal policy that factorizes control into multiple diffusion models, each specialized for a single modality, and uses a router network to learn consensus weights for combining them.
The approach is designed to adapt incrementally when new representations are added or when modalities are missing, avoiding full retraining of a monolithic model.
Experiments on simulated RLBench tasks and real-world manipulation scenarios (e.g., occluded object picking, in-hand spoon reorientation, puzzle insertion) show significant gains over feature-concatenation baselines, especially for multimodal reasoning.
The policy also demonstrates robustness to physical perturbations and sensor corruption, and importance analysis indicates that the system adaptively shifts attention across modalities under different conditions.

Abstract

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.