Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a mismatch between ASR-trained encoders and text-based LLMs that makes Japanese SpeechLLMs output written-style text unsuitable for natural speech synthesis.
It proposes a preference-based alignment approach to produce concise, conversational outputs that are readily synthesized as natural speech.
The authors introduce SpokenElyza, a Japanese speech-worthiness benchmark derived from ELYZA-tasks-100 with auditory verification by native experts.
Experiments show substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation.
They plan to release SpokenElyza to support future research in Japanese spoken dialog systems.

Abstract

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

Key Points

Abstract

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer