AI Navigate

SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

arXiv cs.CV / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies two key challenges in hour-scale real-time animation with autoregressive diffusion: mismatched diffusion states in forcing strategies and unbounded historical representations that hinder stable learning and efficient reuse of cached states.
  • It introduces Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition to provide a stable, distribution-aligned learning signal.
  • It also presents ConvKV, a memory mechanism that compresses keys and values in causal attention into a fixed-length representation to enable constant-memory inference and truly infinite video generation.
  • Experiments show improved training convergence, hour-scale generation quality, and inference efficiency compared with prior AR diffusion methods.
  • The approach enables 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs and achieves state-of-the-art lip-sync accuracy, animation quality, and expressive realism with the lowest inference cost.

Abstract

Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.