AI Navigate

Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a mismatch between ASR-trained encoders and text-based LLMs that makes Japanese SpeechLLMs output written-style text unsuitable for natural speech synthesis.
  • It proposes a preference-based alignment approach to produce concise, conversational outputs that are readily synthesized as natural speech.
  • The authors introduce SpokenElyza, a Japanese speech-worthiness benchmark derived from ELYZA-tasks-100 with auditory verification by native experts.
  • Experiments show substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation.
  • They plan to release SpokenElyza to support future research in Japanese spoken dialog systems.

Abstract

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.