X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

arXiv cs.AI / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that end-to-end Speech LLMs improve latency and paralinguistic modeling but still suffer a large performance gap versus text-based LLMs.
It introduces X-OPD (Cross-Modal On-Policy Distillation), which uses on-policy rollouts to let a speech student model explore its own output distribution.
A text-based teacher model evaluates the student trajectories and supplies token-level feedback to distill the teacher’s capabilities into the student’s multimodal representations.
Experiments on multiple benchmarks show X-OPD significantly narrows the capability gap on complex tasks while largely preserving the student’s existing abilities.
The work positions X-OPD as a training approach that improves over standard SFT and RL methods for aligning speech LLM capabilities with text LLM counterparts.

Abstract

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

Data Sovereignty Rules and Enterprise AI

Dev.to

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Data Sovereignty Rules and Enterprise AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer