AI Navigate

Attention-guided Evidence Grounding for Spoken Question Answering

arXiv cs.CL / 3/18/2026

📰 NewsModels & Research

Key Points

  • Attention-guided Evidence Grounding (AEG) is introduced as an end-to-end framework for Spoken Question Answering that leverages the internal cross-modal attention of Speech Large Language Models to locate and ground key evidence in the model's latent space.
  • Learning to Focus on Evidence (LFE) is proposed as a supervised fine-tuning paradigm that calibrates the model's attention to distinguish query-relevant segments from irrelevant context.
  • Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate reduced hallucinations and strong efficiency, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker).
  • The approach achieves approximately a 62% reduction in inference latency compared with the cascaded baseline.

Abstract

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.