Structured Causal Video Reasoning via Multi-Objective Alignment
arXiv cs.CL / 4/7/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that existing Video-LLMs often perform inefficient and fragile causal inference because they rely on unstructured text rather than a structured mental model of entities, actions, and temporal relations.
- It proposes a compact structured prior called Structured Event Facts that captures salient events and explicit causal relationships before the main reasoning stage.
- To train models on these structured facts, the authors introduce the CausalFact-60K dataset and a four-stage pipeline (facts alignment, format warm-start, thinking warm-start, and RL-based post-training).
- During reinforcement learning, the work treats competing goals—structural completeness, causal fidelity, and reasoning length—as a Multi-Objective RL (MORL) problem and optimizes toward the Pareto frontier to manage trade-offs.
- The resulting model, Factum-4B, is reported to produce more reliable reasoning and improved performance on video understanding benchmarks that require fine-grained temporal causal inference.
Related Articles

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents
MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find
The Register

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
v0.20.5
Ollama Releases

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
Dev.to