LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation

arXiv cs.RO / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • LILACは、自然言語の指示とRGB画像からオブジェクト中心の2D光学フローを生成し、それを6自由度(6-DoF)マニピュレータの軌道に変換する、言語条件付きVLA(Vision-Language-Action)手法を提案しています。
  • 学習は人間やWeb上の動画を用いて行い、装置固有のデータを最小限にすることを目標にしており、軌道生成時の「指示とフローの整合(instruction-flow alignment)」を主要課題として扱っています。
  • 提案手法では、言語条件を指示に整合した光学フローへ強化するSemantic Alignment Lossと、画像・テキスト特徴に対して視覚プロンプトを揃えるPrompt-Conditioned Cross-Modal Adapterの2要素を組み込みます。
  • 複数ベンチマークで光学フローの生成品質が既存手法を上回り、自由形式の指示に基づく実機実験でもタスク成功率が高いことが示されています。

Abstract

We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from pre-manipulation images and natural language instructions requires appropriate instruction-flow alignment. To tackle this challenge, we propose the flow-based Language Instruction-guided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an RGB image and a natural language instruction, and converts the flow into a 6-DoF manipulator trajectory. LILAC incorporates two key components: Semantic Alignment Loss, which strengthens language conditioning to generate instruction-aligned optical flow, and Prompt-Conditioned Cross-Modal Adapter, which aligns learned visual prompts with image and text features to provide rich cues for flow generation. Experimentally, our method outperformed existing approaches in generated flow quality across multiple benchmarks. Furthermore, in physical object manipulation experiments using free-form instructions, LILAC demonstrated a superior task success rate compared to existing methods. The project page is available at https://lilac-75srg.kinsta.page/.