Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs
arXiv cs.CL / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that many AudioLLMs underperform on fine-grained acoustic perception because ASR-centric training encourages suppression of paralinguistic and non-linguistic acoustic cues as “noise.”
- It introduces the Unified Audio Schema (UAS), a structured supervision framework that decomposes audio supervision into Transcription, Paralinguistics, and Non-linguistic Events using a unified JSON format.
- The approach is designed to improve acoustic coverage while maintaining the audio-text alignment needed for strong reasoning in AudioLLMs.
- Experiments on discrete and continuous AudioLLM architectures show consistent gains, including a 10.9% improvement in fine-grained perception on MMSU compared with same-size state-of-the-art baselines.
- The authors report that reasoning capabilities remain robust and provide public code/models via the linked GitHub repository.
Related Articles

Black Hat Asia
AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking
Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance
Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning