Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

arXiv cs.AI / 4/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Mixture-of-Experts Flow Matching (MoE-FM) to overcome flow-matching limitations in language modeling, especially for latent distributions with anisotropy and multimodality.
  • It introduces a non-autoregressive (NAR) language modeling system called YAN, built on MoE-FM and instantiated using both Transformer and Mamba architectures.
  • Across multiple downstream tasks, YAN matches the generation quality of autoregressive and diffusion-based NAR language models while using as few as three sampling steps.
  • The approach reportedly achieves up to ~40× speedup over autoregressive baselines and up to ~10^3× speedup over diffusion-based language models, highlighting major inference-efficiency benefits.
  • Overall, the work positions MoE-FM + NAR decoding as a practical route to faster generative inference without sacrificing quality.

Abstract

Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring as few as three sampling steps. This yields a 40\times speedup over AR baselines and up to a 10^3\times speedup over diffusion language models, demonstrating substantial efficiency advantages for language modeling.