X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
arXiv cs.AI / 3/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that end-to-end Speech LLMs improve latency and paralinguistic modeling but still suffer a large performance gap versus text-based LLMs.
- It introduces X-OPD (Cross-Modal On-Policy Distillation), which uses on-policy rollouts to let a speech student model explore its own output distribution.
- A text-based teacher model evaluates the student trajectories and supplies token-level feedback to distill the teacher’s capabilities into the student’s multimodal representations.
- Experiments on multiple benchmarks show X-OPD significantly narrows the capability gap on complex tasks while largely preserving the student’s existing abilities.
- The work positions X-OPD as a training approach that improves over standard SFT and RL methods for aligning speech LLM capabilities with text LLM counterparts.
Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data
Dev.to
Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots
Dev.to

Data Sovereignty Rules and Enterprise AI
Dev.to