Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces Sculpt4D, a native 4D generative framework aimed at producing high-fidelity dynamic 4D shapes, an area still limited by temporal artifacts and high compute costs.
  • Sculpt4D builds on a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1) by adding efficient temporal modeling to reduce reliance on scarce 4D training data.
  • A Block Sparse Attention mechanism anchors generation to the initial frame to preserve object identity, while using a time-decaying sparse mask to capture motion dynamics.
  • The approach avoids the quadratic cost of full attention and reduces total network computation by 56%, achieving state-of-the-art results for temporally coherent 4D synthesis.
  • Overall, Sculpt4D provides a computationally efficient path toward scalable, higher-quality 4D generation.

Abstract

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.