Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

arXiv cs.CL / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

MAPG (Multi-Agent Probabilistic Grounding) is proposed to enable metrically consistent, actionable decisions in 3D space by decomposing natural language goals into structured subcomponents and grounding each with a vision-language model.
The framework grounds each language component separately and probabilistically composes the results to satisfy metric constraints such as distance and relative position.
MAPG is evaluated on the HM-EQA benchmark, showing consistent improvements over strong baselines, and the authors introduce MAPG-Bench to specifically evaluate metric-semantic goal grounding.
A real-world robot demonstration indicates that MAPG can transfer from simulation to practice when a structured scene representation is available.
The work addresses limitations of current VLM grounding in metric reasoning and proposes an agentic, modular approach to bridge language understanding with metric-grounded navigation.

Abstract

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**

Qiita

Complete Guide: How To Make Money With Ai

Dev.to

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

Dev.to

Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again

Dev.to

How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses

Dev.to

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Key Points

Abstract

Related Articles

ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**

Complete Guide: How To Make Money With Ai

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again

How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer