CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
arXiv cs.CV / 4/22/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- CoInteract introduces an end-to-end framework to synthesize human–object interaction (HOI) videos using conditioning on a person reference image, a product reference image, text prompts, and speech audio.
- The paper targets two common failure modes of diffusion-based HOI video generation: unstable fine structures (hands/faces) and physically implausible contacts such as hand–object interpenetration.
- It proposes a Human-Aware Mixture-of-Experts (MoE) with spatially supervised token routing to route image regions to specialized experts, improving structural fidelity without large parameter increases.
- It also introduces Spatially-Structured Co-Generation, using a dual-stream training setup with an auxiliary HOI-structure stream that injects interaction-geometry priors while removing the HOI branch at inference for zero extra overhead.
- Experiments report significant improvements over prior methods in structural stability, logical consistency, and interaction realism.
