KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

arXiv cs.RO / 4/9/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

KITE is a training-free, keyframe-anchored visual front-end that turns long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs).
It summarizes trajectories into motion-salient keyframes paired with bird’s-eye-view (BEV) schematics capturing relative object layout, axes, timestamps, and detection confidence, then serializes these cues into a unified prompt with robot- and scene-context tokens.
The same KITE prompt structure supports multiple robot-failure-analysis tasks, including failure detection, identification, localization, explanation, and correction using an off-the-shelf VLM.
On the RoboFAC benchmark, KITE with Qwen2.5-VL significantly outperforms vanilla Qwen2.5-VL in the training-free setting, with the biggest improvements in simulation failure detection, identification, and localization.
A small QLoRA fine-tune further boosts explanation and correction quality, and qualitative tests on real dual-arm robots suggest practical applicability, with code and models released.

Abstract

We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/9DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer