Introducing MELI: the Mandarin-English Language Interview Corpus
arXiv cs.CL / 3/31/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MELI, an open-source Mandarin-English Language Interview corpus containing 29.8 hours of speech from 51 bilingual speakers.
- MELI pairs matched Mandarin and English interview sessions featuring two speaking styles—read sentences and spontaneous interviews focused on language varieties, standardness, and learning experiences.
- The dataset provides fully transcribed audio plus word- and phone-level force alignments, anonymized recordings, and supporting metadata including token/type statistics and code-switching patterns.
- Recorded audio is captured at 44.1 kHz (16-bit, stereo), and the corpus is designed to enable cross-/within-language and cross-/within-speaker acoustic comparisons tied to speakers’ language attitudes.
- MELI is set to be released with transcriptions, alignments, metadata, labeled map scans, and documentation under a CC BY-NC 4.0 license.



