AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The article introduces AUDITA, a new large-scale audio question answering benchmark focused on auditing genuine human-vs-AI skill in audio QA rather than relying on easy shortcut cues.
  • The dataset uses human-authored, real-world audio trivia questions with challenging distractors and long-range temporal dependencies, including probing questions intended to be unanswerable from isolated text or audio cues alone.
  • Results show humans achieve an average accuracy of 32.13%, while state-of-the-art audio QA models average below 8.86%, indicating current models struggle with robust audio reasoning.
  • The work applies Item Response Theory (IRT) to estimate latent proficiency and question difficulty, surfacing systematic weaknesses in both the models and the dataset design.
  • Overall, AUDITA is positioned as a more rigorous evaluation framework for audio reasoning that reduces risks of metadata/caption-based bypasses and dataset-specific biases.

Abstract

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.