Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation
arXiv cs.CL / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a bidirectional, multi-domain dataset of English-Chinese passive sentences to enhance MT evaluation of linguistic phenomena related to voice.
- The dataset includes 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters) from five Chinese-English corpora, with automatic structure-label annotations and a manually verified test set.
- It benchmarks two open-source MT systems and four commercial models, revealing that models tend to preserve the source passive voice and are influenced by source voice usage across directions.
- The study finds commercial NMT models excel on standard metrics while LLMs provide more diverse alternative translations, and it notes that datasets and annotation scripts will be shared upon request.
Related Articles
Automating the Chase: AI for Festival Vendor Compliance
Dev.to
MCP Skills vs MCP Tools: The Right Way to Configure Your Server
Dev.to
500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)
Dev.to
Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?
Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER