KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
arXiv cs.RO / 4/9/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- KITE is a training-free, keyframe-anchored visual front-end that turns long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs).
- It summarizes trajectories into motion-salient keyframes paired with bird’s-eye-view (BEV) schematics capturing relative object layout, axes, timestamps, and detection confidence, then serializes these cues into a unified prompt with robot- and scene-context tokens.
- The same KITE prompt structure supports multiple robot-failure-analysis tasks, including failure detection, identification, localization, explanation, and correction using an off-the-shelf VLM.
- On the RoboFAC benchmark, KITE with Qwen2.5-VL significantly outperforms vanilla Qwen2.5-VL in the training-free setting, with the biggest improvements in simulation failure detection, identification, and localization.
- A small QLoRA fine-tune further boosts explanation and correction quality, and qualitative tests on real dual-arm robots suggest practical applicability, with code and models released.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



