Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes DiT-BlockSkip, a memory-efficient fine-tuning method for Diffusion Transformers aimed at reducing compute and memory barriers for text-to-image personalization.
  • It introduces timestep-aware dynamic patch sampling, varying patch sizes across diffusion timesteps and resizing cropped patches to a fixed lower resolution to better balance global vs. fine-grained detail learning.
  • It adds a block-skipping fine-tuning mechanism that selectively updates only essential transformer blocks and precomputes residual features for skipped blocks to cut training memory further.
  • A cross-attention-masking-based block selection strategy is used to identify which blocks are most vital for personalization.
  • Experiments indicate competitive personalization quality while substantially lowering memory usage, supporting the goal of more feasible on-device deployment for large diffusion models.

Abstract

Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.