SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
arXiv cs.CV / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies two key challenges in hour-scale real-time animation with autoregressive diffusion: mismatched diffusion states in forcing strategies and unbounded historical representations that hinder stable learning and efficient reuse of cached states.
- It introduces Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition to provide a stable, distribution-aligned learning signal.
- It also presents ConvKV, a memory mechanism that compresses keys and values in causal attention into a fixed-length representation to enable constant-memory inference and truly infinite video generation.
- Experiments show improved training convergence, hour-scale generation quality, and inference efficiency compared with prior AR diffusion methods.
- The approach enables 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs and achieves state-of-the-art lip-sync accuracy, animation quality, and expressive realism with the lowest inference cost.




