Qwen3.5-Omni テクニカルレポート

arXiv cs.CL / 2026/4/20

📰 ニュースDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

要点

Qwen3.5-OmniはQwen-Omniファミリーの大きな進化として発表され、数千億規模のパラメータと256kのコンテキスト長を実現し、オムニモーダル性能を高めています。
テキスト-ビジョンの多様なペアに加えて、1億時間超のオーディオビジュアルコンテンツからなる大規模マルチモーダルデータで学習され、Qwen3.5-Omni-plusは215の音声/音声・映像理解サブタスクで最先端（SOTA）結果を達成しています。
思考（Thinker）と発話（Talker）の両方に対してHybrid AttentionのMixture-of-Experts（MoE）を採用することで、長系列推論を効率化し、拡張されたインタラクション（例：10時間超の音声理解、720P動画を1FPSで最大400秒）を可能にしています。
ストリーミング音声合成の不安定さや不自然さに対処するため、ARIAを提案し、テキストと音声ユニットを動的に整合させることで、遅延への影響を小さくしつつプロソディを改善します。
さらに、感情のニュアンスを伴う10言語での多言語音声生成、時間同期された構造化キャプションやシーン分割による強力な音声・映像グラウンディング、そして音声・映像指示からコーディングを直接行える新しい能力（Audio-Visual Vibe Coding）を報告しています。

Abstract

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

Appleが「声なき入力」に約3000億円を投じた理由｜Q.ai買収とAirPods Pro 3の接点

Innovatopia

Claude Opus 4.7でトークン消費量がどれだけ増えたか可視化するサイトが登場、同じ入力で4.6の2倍消費する実例も

GIGAZINE

北京ヒューマノイドロボットハーフマラソンで優勝記録更新、CursorがバリュエーションUS$50Bでの調達協議など：2026-04-20 AI動向まとめ

Qiita

LINEやYahoo!検索に謎のロボットアイコン登場、いったい何者？　正体は……

ITmedia AI+

スクエニ、マンガの「写植指定」をAIで効率化　試用編集者の100％が「継続利用したい」

ITmedia AI+

Qwen3.5-Omni テクニカルレポート

要点

Abstract

関連記事

Appleが「声なき入力」に約3000億円を投じた理由｜Q.ai買収とAirPods Pro 3の接点

Claude Opus 4.7でトークン消費量がどれだけ増えたか可視化するサイトが登場、同じ入力で4.6の2倍消費する実例も

北京ヒューマノイドロボットハーフマラソンで優勝記録更新、CursorがバリュエーションUS$50Bでの調達協議など：2026-04-20 AI動向まとめ

LINEやYahoo!検索に謎のロボットアイコン登場、いったい何者？　正体は……

スクエニ、マンガの「写植指定」をAIで効率化　試用編集者の100％が「継続利用したい」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

Appleが「声なき入力」に約3000億円を投じた理由｜Q.ai買収とAirPods Pro 3の接点

Claude Opus 4.7でトークン消費量がどれだけ増えたか可視化するサイトが登場、同じ入力で4.6の2倍消費する実例も

北京ヒューマノイドロボットハーフマラソンで優勝記録更新、CursorがバリュエーションUS$50Bでの調達協議など：2026-04-20 AI動向まとめ

LINEやYahoo!検索に謎のロボットアイコン登場、いったい何者？ 正体は……

スクエニ、マンガの「写植指定」をAIで効率化 試用編集者の100％が「継続利用したい」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

LINEやYahoo!検索に謎のロボットアイコン登場、いったい何者？　正体は……

スクエニ、マンガの「写植指定」をAIで効率化　試用編集者の100％が「継続利用したい」