GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

arXiv cs.RO / 4/21/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces GaLa, a vision-language framework for multimodal procedural planning in embodied AI that targets challenges in understanding functional spatial relationships in complex scenes.
GaLa uses a hypergraph representation of visual inputs, treating object instances as nodes and creating region-level hyperedges by aggregating objects based on attributes and functional semantics.
It proposes a TriView HyperGraph Encoder that applies contrastive learning to keep semantic representations consistent across multiple views (node view, area view, and node-area association view).
Experiments on ActPlan1K and ALFRED show that GaLa achieves notably better execution success rate, LCS, and planning correctness than existing methods.
The overall approach shifts some burden from relying purely on VLM reasoning to explicitly injecting structured semantic and spatial information from multimodal data into downstream planning.

Abstract

Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

Dev.to

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Dev.to

Open Source Contributors Needed for Skillware & Rooms (AI/ML/Python)

Dev.to

Production LLM systematically violates tool schema constraints to invent UI features; observed over ~2,400 messages [D]

Reddit r/MachineLearning

My AI system kept randomly switching to French mid-answer and it took me way too long to figure out why

Reddit r/artificial

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

Key Points

Abstract

Related Articles

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Open Source Contributors Needed for Skillware & Rooms (AI/ML/Python)

Production LLM systematically violates tool schema constraints to invent UI features; observed over ~2,400 messages [D]

My AI system kept randomly switching to French mid-answer and it took me way too long to figure out why

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer