LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that Flow Matching image generation models suffer high inference cost because they repeatedly denoise using large Transformer networks.
It finds that Transformer layer groups have highly heterogeneous velocity dynamics, with shallow layers being stable enough for caching while deeper layers require full computation.
It introduces LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent caching decisions per group at each denoising step.
LayerCache includes an adaptive JVP span (K) selection mechanism and uses a timestep/layer-group/K scheduling formulation solved via a greedy budget allocation algorithm.
Experiments on Qwen-Image (1024×1024, 50 steps) show strong quality-speed gains versus MeanCache and prior caching methods, including +5.38 dB PSNR and a 70% LPIPS reduction with 1.37× speedup.

Abstract

Flow Matching models achieve state-of-the-art image generation quality but incur substantial inference cost due to iterative denoising through large Transformer networks. We observe that different layer groups within a Transformer exhibit markedly heterogeneous velocity dynamics: shallow layers are highly stable and amenable to aggressive caching, while deep layers undergo large velocity changes that demand full computation. Existing caching methods, however, treat the entire Transformer as a monolithic unit, applying a single caching decision per timestep and thus failing to exploit this heterogeneity. Based on this finding, we propose LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent, per-group caching decisions at each denoising step. LayerCache introduces an adaptive JVP span K selection mechanism that leverages per-group stability measurements to balance estimation accuracy and computational savings. We formulate a three-dimensional scheduling problem over timesteps, layer groups, and JVP span, and solve it with a greedy budget allocation algorithm. On Qwen-Image (1024x1024, 50 steps), LayerCache achieves PSNR 37.46 dB (+5.38 dB over MeanCache), SSIM 0.9834, and LPIPS 0.0178 (a 70% reduction over MeanCache) at 1.37x speedup, dominating all prior caching methods on the quality-speed Pareto frontier.

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Dev.to

Salesforce Headless 360: Run Your CRM Without a Browser

Dev.to

RAG Systems in Production: Building Enterprise Knowledge Search

Dev.to

We Built a 31-Agent AI Team That Hires Itself, Critiques Itself, and Dreams

Dev.to

gpt-image-2 API: ship 2K AI images in Next.js for $0.21 (2026)

Dev.to

LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference

Key Points

Abstract

Related Articles

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Salesforce Headless 360: Run Your CRM Without a Browser

RAG Systems in Production: Building Enterprise Knowledge Search

We Built a 31-Agent AI Team That Hires Itself, Critiques Itself, and Dreams

gpt-image-2 API: ship 2K AI images in Next.js for $0.21 (2026)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer