WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
arXiv cs.AI / 4/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces WAND (Windowed Attention and Knowledge Distillation) to make decoder-only autoregressive text-to-speech (AR-TTS) models run with constant memory/compute as sequence length grows.
- WAND modifies the attention scheme by using persistent global attention over conditioning tokens and sliding-window (local) attention over generated tokens to avoid quadratic scaling.
- It stabilizes fine-tuning via curriculum learning that progressively narrows the attention window over time.
- The method uses knowledge distillation from a full-attention teacher to retain high-fidelity speech quality while improving data efficiency.
- Experiments on three modern AR-TTS models show quality preservation alongside up to 66.2% KV-cache memory reduction and near-constant per-step latency.
Related Articles

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)
Dev.to

Free AI Tools With No Message Limits — The Definitive List (2026)
Dev.to

Why Domain Knowledge Is Critical in Healthcare Machine Learning
Dev.to