ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
arXiv cs.CL / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard raw-token reinforcement learning for end-to-end full-duplex speech language models can harm temporal dynamics, leading to semantic degradation, repetition, and generative collapse.
- It introduces ASPIRin, an interactivity-optimized RL framework that explicitly separates timing control (when to speak vs. when to remain silent) from content generation (what to say).
- ASPIRin uses Action Space Projection to convert the text vocabulary into a coarse-grained binary state representing active speech vs. inactive silence, then applies Group Relative Policy Optimization (GRPO) with rule-based rewards.
- Empirical results indicate ASPIRin improves interactivity across turn-taking, backchanneling, and pause handling while substantially reducing duplicate n-grams by over 50% versus standard GRPO.
- The key takeaway is that isolating timing from token selection preserves semantic coherence and mitigates degenerative repetition behavior.


