WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces WAND (Windowed Attention and Knowledge Distillation) to make decoder-only autoregressive text-to-speech (AR-TTS) models run with constant memory/compute as sequence length grows.
  • WAND modifies the attention scheme by using persistent global attention over conditioning tokens and sliding-window (local) attention over generated tokens to avoid quadratic scaling.
  • It stabilizes fine-tuning via curriculum learning that progressively narrows the attention window over time.
  • The method uses knowledge distillation from a full-attention teacher to retain high-fidelity speech quality while improving data efficiency.
  • Experiments on three modern AR-TTS models show quality preservation alongside up to 66.2% KV-cache memory reduction and near-constant per-step latency.

Abstract

Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency.