AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article introduces AUDITA, a new large-scale audio question answering benchmark focused on auditing genuine human-vs-AI skill in audio QA rather than relying on easy shortcut cues.
The dataset uses human-authored, real-world audio trivia questions with challenging distractors and long-range temporal dependencies, including probing questions intended to be unanswerable from isolated text or audio cues alone.
Results show humans achieve an average accuracy of 32.13%, while state-of-the-art audio QA models average below 8.86%, indicating current models struggle with robust audio reasoning.
The work applies Item Response Theory (IRT) to estimate latent proficiency and question difficulty, surfacing systematic weaknesses in both the models and the dataset design.
Overall, AUDITA is positioned as a more rigorous evaluation framework for audio reasoning that reduces risks of metadata/caption-based bypasses and dataset-specific biases.

Abstract

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.