AI Navigate

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

arXiv cs.CL / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • OSCBench introduces a specialized benchmark to evaluate object state change (OSC) understanding in text-to-video models, addressing a gap not covered by existing benchmarks.
  • Built from instructional cooking data, OSCBench organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization.
  • The authors evaluate six representative open-source and proprietary T2V models using human user studies and multimodal LLM-based automatic evaluation, revealing strong performance on semantic and scene alignment but persistent difficulty with OSC.
  • The study positions OSC as a key bottleneck for state-aware video generation and establishes OSCBench as a diagnostic tool to guide future model improvements.

Abstract

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.