[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

Reddit r/MachineLearning / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

VOID is presented as a video inpainting/object-removal model that targets physically consistent outcomes by accounting for how an object influences scene dynamics, not just its appearance.
The method uses counterfactual scene evolution, asking “what would the video look like if the object had never been there?” and addresses issues where removing an object should change subsequent events (e.g., domino falls or avoiding a collision).
It is trained with counterfactual paired data generated using Kubric and HUMOTO, and it leverages a vision-language model to guide which regions/motions are affected by the removal.
VOID applies a two-pass generation strategy—first predicting motion changes and then refining with flow-warped noise to improve temporal consistency.
In a human preference study on real-world videos, VOID was chosen 64.8% of the time over several baselines (e.g., Runway/Aleph, Generative Omnimatte, ProPainter), indicating improved plausibility for physically interactive scenes.

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance.

Most existing video inpainting / object removal methods can fill in pixels behind an object (e.g., removing shadows or reflections), but they often fail when the removed object affects the dynamics of the scene.

For example:
- A domino chain is falling → removing the middle blocks should stop the chain
- Two cars are about to crash → removing one car should prevent the collision

Current models typically remove the object but leave its effects unchanged, resulting in physically implausible outputs.

VOID addresses this by modeling counterfactual scene evolution:
“What would the video look like if the object had never been there?”

Key ideas:
- Counterfactual training data: paired videos with and without objects (generated using Kubric and HUMOTO)
- VLM-guided masks: a vision-language model identifies which regions of the scene are affected by the removal
- Two-pass generation: first predict the new motion, then refine with flow-warped noise for temporal consistency

In a human preference study on real-world videos, VOID was selected 64.8% of the time over baselines such as Runway (Aleph), Generative Omnimatte, and ProPainter.

Project page: https://void-model.github.io/
Code: https://github.com/Netflix/void-model
Demo: https://huggingface.co/spaces/sam-motamed/VOID
Paper: https://arxiv.org/abs/2604.02296

Happy to answer questions!

Removing the compressor and saving the duckie.

submitted by /u/Least_Light6037
[link] [comments]