MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
arXiv cs.CL / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MoshiRAG, a modular retrieval-augmented approach aimed at improving factuality in full-duplex speech-to-speech language models without relying on costly model scaling.
- It uses an asynchronous framework that triggers knowledge retrieval for knowledge-demanding queries and exploits the natural timing gap during conversation to complete retrieval without disrupting turn-taking.
- MoshiRAG combines a compact full-duplex interface with selective retrieval from stronger external knowledge sources to maintain real-time interactivity (pauses, interruptions, backchannels).
- The authors report factuality comparable to leading publicly released non-duplex speech language models while preserving full-duplex responsiveness.
- The design is claimed to be plug-and-play, allowing different retrieval methods to be swapped in without retraining, with additional strong results on out-of-domain mathematical reasoning tasks.




