Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a mismatch between ASR-trained encoders and text-based LLMs that makes Japanese SpeechLLMs output written-style text unsuitable for natural speech synthesis.
It proposes a preference-based alignment approach to produce concise, conversational outputs that are readily synthesized as natural speech.
The authors introduce SpokenElyza, a Japanese speech-worthiness benchmark derived from ELYZA-tasks-100 with auditory verification by native experts.
Experiments show substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation.
They plan to release SpokenElyza to support future research in Japanese spoken dialog systems.

Abstract

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Dev.to

Data Augmentation Using GANs

Dev.to

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

Dev.to

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

Dev.to

Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

Key Points

Abstract

Related Articles

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Data Augmentation Using GANs

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer