Building a Precise Video Language with Human-AI Oversight

arXiv cs.CV / 4/24/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper presents a structured specification for video-language modeling that covers subjects, scenes, motion, spatial/camera dynamics, and is grounded in hundreds of carefully designed visual primitives created with professional video creators.
  • It introduces CHAI (Critique-based Human-AI Oversight), where trained human experts critique and revise model-generated “pre-captions” into improved “post-captions,” improving annotation accuracy and efficiency by letting models handle text generation.
  • The critique process itself is used as supervision to improve open-source VLMs (including Qwen3-VL) via techniques such as SFT, DPO, and inference-time scaling, with ablations showing oversight critique quality drives downstream gains.
  • Experiments report that models trained with modest expert oversight can outperform closed-source captioning systems like Gemini-3.1-Pro, and the approach is applied to re-caption large-scale professional videos and to fine-tune video generation models (e.g., Wan) for more controllable, cinematography-aware prompt following up to 400 words.
  • The authors release datasets, benchmarks, and recipes along with data/code on the project page, aiming to make scalable oversight and precise video captioning more accessible to the research community.

Abstract

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/