ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard raw-token reinforcement learning for end-to-end full-duplex speech language models can harm temporal dynamics, leading to semantic degradation, repetition, and generative collapse.
  • It introduces ASPIRin, an interactivity-optimized RL framework that explicitly separates timing control (when to speak vs. when to remain silent) from content generation (what to say).
  • ASPIRin uses Action Space Projection to convert the text vocabulary into a coarse-grained binary state representing active speech vs. inactive silence, then applies Group Relative Policy Optimization (GRPO) with rule-based rewards.
  • Empirical results indicate ASPIRin improves interactivity across turn-taking, backchanneling, and pause handling while substantially reducing duplicate n-grams by over 50% versus standard GRPO.
  • The key takeaway is that isolating timing from token selection preserves semantic coherence and mitigates degenerative repetition behavior.

Abstract

End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.