Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces Sculpt4D, a native 4D generative framework aimed at producing high-fidelity dynamic 4D shapes, an area still limited by temporal artifacts and high compute costs.
Sculpt4D builds on a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1) by adding efficient temporal modeling to reduce reliance on scarce 4D training data.
A Block Sparse Attention mechanism anchors generation to the initial frame to preserve object identity, while using a time-decaying sparse mask to capture motion dynamics.
The approach avoids the quadratic cost of full attention and reduces total network computation by 56%, achieving state-of-the-art results for temporally coherent 4D synthesis.
Overall, Sculpt4D provides a computationally efficient path toward scalable, higher-quality 4D generation.

Abstract

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Dev.to

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Reddit r/LocalLLaMA

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Key Points

Abstract

Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer