ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

arXiv cs.CV / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ScrollScape addresses a key failure mode of diffusion-based ultra-high-resolution, extreme-aspect-ratio (EAR) image synthesis—catastrophic structural breakdowns like object repetition and spatial fragmentation—by attributing it to insufficient spatial priors in conventional text-to-image training.
  • The framework converts EAR image generation into a continuous video-generation problem, using video temporal consistency as a global constraint to preserve long-range structure.
  • It introduces ScanPE to distribute global coordinates across video frames via a “moving camera” mechanism, and ScrollSR to apply video super-resolution priors to reach unprecedented 32K resolution while avoiding memory bottlenecks.
  • ScrollScape is fine-tuned on a curated 3K multi-ratio dataset and demonstrates strong performance over existing image-diffusion baselines, notably reducing severe localized artifacts and improving global coherence and visual fidelity across domains.
  • Overall, the work suggests a general strategy for extreme-scale image generation: reparameterize spatially challenging image tasks using temporally grounded video priors for more robust global structure.

Abstract

While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.