Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

arXiv cs.CL / 4/28/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces “Human-1,” an open, reproducible full-duplex spoken dialogue system for Hindi, designed to handle realistic conversation phenomena like interruptions, overlaps, and backchannels.
  • It builds on the Moshi duplex speech architecture by adding a custom Hindi tokenizer and training with 26,000 hours of real spontaneous conversations from 14,695 speakers, using separate speaker channels to learn turn-taking and overlap patterns directly.
  • For Hindi text generation, the authors replace the original English tokenizer and reinitialize text-vocabulary-dependent parameters while keeping the pre-trained audio components.
  • The training approach uses a two-stage recipe—large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data.
  • Experiments using prompted dialogue continuation show, via both automatic metrics and human evaluations, that the model produces natural, meaningful full-duplex conversational behavior in Hindi and aims to extend this to other Indian languages.

Abstract

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.