LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that Flow Matching image generation models suffer high inference cost because they repeatedly denoise using large Transformer networks.
  • It finds that Transformer layer groups have highly heterogeneous velocity dynamics, with shallow layers being stable enough for caching while deeper layers require full computation.
  • It introduces LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent caching decisions per group at each denoising step.
  • LayerCache includes an adaptive JVP span (K) selection mechanism and uses a timestep/layer-group/K scheduling formulation solved via a greedy budget allocation algorithm.
  • Experiments on Qwen-Image (1024×1024, 50 steps) show strong quality-speed gains versus MeanCache and prior caching methods, including +5.38 dB PSNR and a 70% LPIPS reduction with 1.37× speedup.

Abstract

Flow Matching models achieve state-of-the-art image generation quality but incur substantial inference cost due to iterative denoising through large Transformer networks. We observe that different layer groups within a Transformer exhibit markedly heterogeneous velocity dynamics: shallow layers are highly stable and amenable to aggressive caching, while deep layers undergo large velocity changes that demand full computation. Existing caching methods, however, treat the entire Transformer as a monolithic unit, applying a single caching decision per timestep and thus failing to exploit this heterogeneity. Based on this finding, we propose LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent, per-group caching decisions at each denoising step. LayerCache introduces an adaptive JVP span K selection mechanism that leverages per-group stability measurements to balance estimation accuracy and computational savings. We formulate a three-dimensional scheduling problem over timesteps, layer groups, and JVP span, and solve it with a greedy budget allocation algorithm. On Qwen-Image (1024x1024, 50 steps), LayerCache achieves PSNR 37.46 dB (+5.38 dB over MeanCache), SSIM 0.9834, and LPIPS 0.0178 (a 70% reduction over MeanCache) at 1.37x speedup, dominating all prior caching methods on the quality-speed Pareto frontier.