Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
arXiv cs.CV / 3/13/2026
💬 OpinionModels & Research
Key Points
- The paper presents a framework for generating egocentric videos from a single reference frame using sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structure.
- It introduces an occlusion-aware control module that resolves unreliable signals from hidden joints and employs a 3D-based weighting mechanism to handle dynamically occluded target joints during motion propagation.
- The method injects 3D geometric embeddings into the latent space to enforce structural consistency and develops an automated annotation pipeline yielding over one million egocentric video clips with precise hand trajectories, plus a cross-embodiment benchmark.
- Extensive experiments show the approach significantly outperforms state-of-the-art baselines and generalizes well to robotic hands.
Related Articles
Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI
TechCrunch
[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)
Reddit r/MachineLearning
My Experience with Qwen 3.5 35B
Reddit r/LocalLLaMA

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4
VentureBeat
Qwen 3.5 122B completely falls apart at ~ 100K context
Reddit r/LocalLLaMA