VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment
arXiv cs.CV / 3/18/2026
📰 NewsModels & Research
Key Points
- The paper notes that video diffusion models lack explicit geometric supervision during training, causing artifacts such as object deformation, spatial drift, and depth violations in generated videos.
- It introduces a geometry-based reward that leverages pretrained geometric foundation models to evaluate multi-view consistency via cross-frame reprojection error, computed in a pointwise fashion for robustness over pixel-space comparisons.
- It proposes a geometry-aware sampling strategy that filters out low-texture and non-semantic regions to focus evaluation on geometrically meaningful areas with reliable correspondences.
- The reward enables two pathways for alignment: post-training of a bidirectional model through supervised fine-tuning (SFT) or reinforcement learning, and inference-time optimization of a causal video model with test-time scaling, showing robustness and practical benefits without extensive retraining.
Related Articles
[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning
Reddit r/MachineLearning
[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop
Reddit r/MachineLearning
Meet DuckLLM 1.0 My First Model!
Reddit r/LocalLLaMA
Since FastFlowLM added support for Linux, I decided to benchmark all the models they support, here are some results
Reddit r/LocalLLaMA
What measure do I use to compare nested models and non nested models in high dimensional survival analysis [D]
Reddit r/MachineLearning