Qwen3.5-Omni learned to write code from spoken instructions and video without anyone training it to

THE DECODER / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Alibaba has released Qwen3.5-Omni, an omnimodal AI model that can process text, images, audio, and video in a single system.
  • The model is positioned as outperforming Gemini 3.1 Pro on audio-related tasks, according to the article.
  • A notable capability is that Qwen3.5-Omni can generate code from spoken instructions and video inputs, even though no one explicitly trained it for code-writing from those modalities.
  • The release highlights a broader trend of multimodal models exhibiting emergent abilities across modalities without narrowly targeted training.

Alibaba's advertising graphic shows two teddy bears in traditional Chinese clothing. The bear on the left is sitting at a desk in front of a monitor and represents Qwen3.5-Omni-Plus with functions such as SOTA Performance, Detailed Audio-Visual Captioning, Native Multimodal and Extensive Multilingual. The bear on the right is holding a smartphone and represents Qwen3.5-Omni-Plus-Realtime with Voice Control, WebSearch Tool, Voice Clone and Semantic Interruption.

Alibaba has released Qwen3.5-Omni, an omnimodal AI model that processes text, images, audio, and video. It claims to beat Gemini 3.1 Pro on audio tasks and picked up an unexpected trick along the way: writing code from spoken instructions and video input.

The article Qwen3.5-Omni learned to write code from spoken instructions and video without anyone training it to appeared first on The Decoder.