Frequency-Aware Flow Matching for High-Quality Image Generation

arXiv cs.CV / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • Flow matching models can generate realistic images by reversing a Gaussian-noise corruption process, but the noise affects latent-domain frequency components unevenly, delaying high-frequency (detail) creation during inference.
  • The paper proposes Frequency-Aware Flow Matching (FreqFlow), which adds frequency-aware, time-dependent adaptive weighting to condition the flow process so low-frequency structure and high-frequency details are produced more effectively throughout sampling.
  • FreqFlow uses a two-branch design: a frequency branch that separately models low- and high-frequency components, and a spatial latent-domain branch that synthesizes images guided by the frequency branch.
  • On ImageNet-256 class-conditional generation, FreqFlow achieves state-of-the-art results with an FID of 1.38, improving over prior diffusion (DiT) and flow-matching (SiT) approaches by 0.79 and 0.58 FID, respectively.
  • The authors release code via GitHub, enabling replication and further experimentation with the proposed method.

Abstract

Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at https://github.com/OliverRensu/FreqFlow.