OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The OmniEncoder paper argues that existing omni-modal LLM architectures use mismatched sampling rates (video at 1–2 fps vs audio at ~25 fps), causing models to process modalities in a fragmented, frame-by-frame way rather than holistically like humans.
OmniEncoder proposes a unified Transformer backbone that co-embeds visual and audio at a symmetrical 25 fps into a shared latent space to improve cross-modal interaction and capture fine-grained visual motion.
The method introduces three components—Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting—to address modality disentanglement while keeping computational efficiency manageable.
Experiments report substantial improvements over a modality-specific baseline (Qwen2.5-Omni) under the same input token budget for continuous visual understanding tasks such as sign language recognition and fine-grained sports action analysis.
OmniEncoder also maintains competitive results on established audio-visual benchmarks (AVQA and speaker identification/localization), suggesting the unified encoding approach is broadly effective.

Abstract

Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emph{video-coarse, audio-dense} design -- sampling visual frames at 1--2 fps while processing audio waveforms at 25 fps -- resulting in systems that perceive video \emph{frame by frame, modality by modality} rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbf{Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps} within a shared latent space. This architecture leverages three core innovations -- the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting -- to effectively reconcile the dual challenges of modality disentanglement and computational efficiency. Experiments demonstrate that, compared to the modality-specific baseline Qwen2.5-Omni under the same input token budget to the LLM decoder, Omni-Encoder delivers substantial gains on visual continuous understanding tasks -- such as sign language recognition and fine-grained sports action analysis -- while maintaining competitive performance on established audio-visual benchmarks such as AVQA and Speaker Identification and Localization. These results suggest that unified omnivorous encoding offers a promising direction for building omni-modal models that more closely reflect the integrated nature of human perception.