RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

arXiv cs.CL / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces RespondeoQA, a bilingual Latin-English question answering and translation benchmark with about 7,800 question-answer pairs.
  • Questions are sourced from Latin pedagogical materials such as exams, quizbowl-style trivia, and textbooks spanning the 1800s to the present, and are curated via automated extraction, cleaning, and manual review.
  • The benchmark includes multiple task types, including knowledge/skill questions, multihop reasoning, constrained translation, and mixed-language pairs.
  • In evaluations of three large language models (LLaMa 3, Qwen QwQ, and OpenAI o3-mini), all models generally perform worse on skill-oriented questions, with reasoning-focused models doing better on scansion and literary-device tasks.
  • The dataset is released publicly and the authors note the construction pipeline can be adapted to benchmark other languages.

Abstract

We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA