AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
arXiv cs.CL / 4/24/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The article introduces AUDITA, a new large-scale audio question answering benchmark focused on auditing genuine human-vs-AI skill in audio QA rather than relying on easy shortcut cues.
- The dataset uses human-authored, real-world audio trivia questions with challenging distractors and long-range temporal dependencies, including probing questions intended to be unanswerable from isolated text or audio cues alone.
- Results show humans achieve an average accuracy of 32.13%, while state-of-the-art audio QA models average below 8.86%, indicating current models struggle with robust audio reasoning.
- The work applies Item Response Theory (IRT) to estimate latent proficiency and question difficulty, surfacing systematic weaknesses in both the models and the dataset design.
- Overall, AUDITA is positioned as a more rigorous evaluation framework for audio reasoning that reduces risks of metadata/caption-based bypasses and dataset-specific biases.



