AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

arXiv cs.AI / 3/18/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

AsgardBench introduces a benchmark to evaluate visually grounded, high-level action sequence generation and interactive planning with plan adaptation driven by visual observations rather than navigation or low-level manipulation.
The benchmark isolates interactive planning by restricting inputs to images, action history, and lightweight success/failure signals within a controlled simulator to avoid perception substitutions.
It comprises 108 task instances across 12 task types with systematic variations to create conditional branches that require plan repair during execution.
Evaluations indicate that leading vision-language models struggle without visual input, revealing weaknesses in visual grounding and state tracking that hinder interactive planning.

Abstract

With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

Ledge.ai

AIに心を持たせる試みについて

note

AIと創作

note

まな式AI活用術で、人生が動き出した人たち

note

人間とLLMは、次に来る言葉をどう予測するのか

note

AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Key Points

Abstract

Related Articles

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

AIと創作

まな式AI活用術で、人生が動き出した人たち

人間とLLMは、次に来る言葉をどう予測するのか

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

富士通、日本初の防衛テックアクセラレータ開始 防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

AIと創作

まな式AI活用術で、人生が動き出した人たち

人間とLLMは、次に来る言葉をどう予測するのか

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像