Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes how different frequency components of the input noise in text-to-image diffusion models affect global structure, color composition, and fine details.
  • It argues that while white Gaussian noise provides diversity, its lack of human-interpretable structure limits controllability and predictability of visual attributes.
  • The authors show that low-frequency noise is mainly responsible for global structure and color, whereas high-frequency noise drives finer details.
  • They propose a training-free technique that manipulates low-frequency noise using low-frequency image priors to steer the generation process with minimal overhead.
  • By constraining global/color cues through low-frequency manipulation while allowing high-frequency components to emerge naturally, the method improves conditional generation without reducing output diversity.

Abstract

Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.