Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

本論文は、テキスト指示に整合する目標「6D物体ポーズ（位置・姿勢）」をVLMが推定できないという課題に対し、推論時の反復的な閉ループ手法で大幅な性能向上を示す。
RGB-D（またはメッシュ合成）の3Dシーンに対し、(1)現状観察→(2)指示への忠実性評価→(3)対象物のポーズ更新提案→(4)更新を適用して再レンダリング、を繰り返すことでVLMをエージェントとして機能させる。
閉ループに必須とされる3つの推論時テクニックとして、支持視点選択を伴うマルチビュー推論、対象物中心座標の可視化、単一軸回転予測を導入する。
追加の微調整や新モジュールなしで、クローズド/オープン両方のVLMで先行手法を上回り、さらにロボットの簡易モーション計画と組み合わせて既存手法より高いロボット操作成功率を達成する。

Abstract

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/14DailyView insight →

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Key Points

Abstract

💡 Insights using this article

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer