HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces HiVLA, a hierarchical visual-grounded-centric embodied manipulation system that separates high-level semantic planning from low-level motor control to avoid degrading a base vision-language model’s reasoning during fine-tuning.
  • In the high-level stage, a VLM planner performs task decomposition and visual grounding, outputting structured plans that include subtask instructions and target bounding boxes.
  • For low-level execution, HiVLA uses a flow-matching Diffusion Transformer (DiT) action expert with a cascaded cross-attention mechanism to integrate global context, object-centric crops, and skill semantics for robust action generation.
  • Experiments in both simulation and the real world report that HiVLA significantly outperforms end-to-end VLA baselines, with particular strength in long-horizon skill composition and small-object manipulation in cluttered environments.
  • The proposed decoupled architecture is designed to preserve the base VLM’s zero-shot reasoning while allowing independent improvements to the planning and action components over time.

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.