PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

arXiv cs.CV / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • PRISM is a new, 270K-sample multi-view retail video supervised fine-tuning dataset designed for embodied vision-language models in real-world supermarket settings.
  • The dataset is built on a 3D knowledge ontology covering spatial knowledge, temporal/physical knowledge, and embodied action knowledge, enabling evaluation across 20+ capability probes.
  • PRISM includes diverse viewpoints (egocentric, exocentric, and 360°) from five supermarket locations, with multiple supervision formats such as open-ended, chain-of-thought, and multiple-choice.
  • Fine-tuning embodied VLMs on PRISM significantly lowers error rates across all probes by 66.6% versus the pre-trained baseline, with the largest improvements in embodied action understanding (+36.4% accuracy).
  • The authors position PRISM as one of the largest domain-specific video SFT corpora (about 11.8M frames and ~730M tokens) and release the dataset at dreamvu.ai/prism.

Abstract

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360{\deg} viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism