Attention-guided Evidence Grounding for Spoken Question Answering

arXiv cs.CL / 3/18/2026

📰 NewsModels & Research

共有:

Key Points

Attention-guided Evidence Grounding (AEG) is introduced as an end-to-end framework for Spoken Question Answering that leverages the internal cross-modal attention of Speech Large Language Models to locate and ground key evidence in the model's latent space.
Learning to Focus on Evidence (LFE) is proposed as a supervised fine-tuning paradigm that calibrates the model's attention to distinguish query-relevant segments from irrelevant context.
Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate reduced hallucinations and strong efficiency, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker).
The approach achieves approximately a 62% reduction in inference latency compared with the cascaded baseline.

Abstract

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

Interesting loop

Reddit r/LocalLLaMA

Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants

Reddit r/LocalLLaMA

FeatherOps: Fast fp8 matmul on RDNA3 without native fp8

Reddit r/LocalLLaMA

VerityFlow-AI: Engineering a Multi-Agent Swarm for Real-Time Truth-Validation and Deep-Context Media Synthesis

Dev.to

: [R] Sinc Reconstruction for LLM Prompts: Applying Nyquist-Shannon to the Specification Axis (275 obs, 97% cost reduction, open source)

Reddit r/MachineLearning

Attention-guided Evidence Grounding for Spoken Question Answering

Key Points

Abstract

Related Articles

Interesting loop

Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants

FeatherOps: Fast fp8 matmul on RDNA3 without native fp8

VerityFlow-AI: Engineering a Multi-Agent Swarm for Real-Time Truth-Validation and Deep-Context Media Synthesis

: [R] Sinc Reconstruction for LLM Prompts: Applying Nyquist-Shannon to the Specification Axis (275 obs, 97% cost reduction, open source)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer