Qwen3.5-Omni Technical Report

arXiv cs.CL / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • Qwen3.5-Omni is introduced as a major upgrade in the Qwen-Omni family, scaling to hundreds of billions of parameters and supporting a 256k context length for stronger omni-modality.
  • The model is trained on a large multimodal dataset of heterogeneous text-vision pairs plus over 100 million hours of audio-visual content, and Qwen3.5-Omni-plus reaches state-of-the-art results across 215 audio/audio-visual subtasks.
  • Architecturally, it uses a Hybrid Attention Mixture-of-Experts (MoE) design for both Thinker and Talker to enable efficient long-sequence inference and extended interaction capabilities (e.g., 10+ hours of audio understanding and up to 400 seconds of 720P video at 1 FPS).
  • To improve streaming speech synthesis stability and naturalness, the report proposes ARIA, which dynamically aligns text and speech units to enhance prosody with minimal added latency.
  • The technical report highlights expanded multilingual speech generation across 10 languages with emotional nuance, strong audio-visual grounding (structured, temporally synchronized captions and scene segmentation), and a newly observed “Audio-Visual Vibe Coding” ability to perform coding from audio-visual instructions.

Abstract

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.