JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems
arXiv cs.CL / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces JAL-Turn, a lightweight speech-only turn-taking detection framework designed for industrial-grade full-duplex spoken dialogue systems where robustness and low latency are difficult to achieve.
- JAL-Turn uses a joint acoustic-linguistic modeling approach with a cross-attention module to integrate pre-trained acoustic representations with linguistic features for fast hold-vs-shift prediction.
- By sharing a frozen ASR encoder, the method runs turn-taking prediction fully in parallel with speech recognition, aiming to add no extra end-to-end latency or computational cost.
- The authors also propose an automated, scalable data construction pipeline that derives turn-taking labels from large real-world dialogue corpora.
- Experiments on multilingual public benchmarks and an in-house Japanese customer-service dataset show JAL-Turn improves turn-taking detection accuracy over strong baselines while preserving real-time performance.
Related Articles

Black Hat Asia
AI Business
Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to
I missed the "fun" part in software development
Dev.to
The Billion Dollar Tax on AI Agents
Dev.to