Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation
arXiv cs.CL / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a bidirectional, multi-domain dataset of English-Chinese passive sentences to enhance MT evaluation of linguistic phenomena related to voice.
- The dataset includes 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters) from five Chinese-English corpora, with automatic structure-label annotations and a manually verified test set.
- It benchmarks two open-source MT systems and four commercial models, revealing that models tend to preserve the source passive voice and are influenced by source voice usage across directions.
- The study finds commercial NMT models excel on standard metrics while LLMs provide more diverse alternative translations, and it notes that datasets and annotation scripts will be shared upon request.




