CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces CoLoRSMamba, a directional video-to-audio multimodal architecture that links a VideoMamba encoder with an AudioMamba module using CLS-guided conditional LoRA for scene-aware audio modeling.
Instead of token-level cross-attention, the VideoMamba CLS token generates channel-wise modulation and a stabilization gate to adapt AudioMamba’s selective state-space parameters (including the step-size pathway).
Training uses a combination of binary violence classification and a symmetric AV-InfoNCE contrastive objective to align clip-level audio and video embeddings.
For fair evaluation under real-world conditions, the authors curate audio-filtered clip-level subsets of NTU-CCTV and DVD based on temporal annotations, keeping only clips where audio is available.
On these subsets, CoLoRSMamba reports improved results (88.63% accuracy / 86.24% F1-V on NTU-CCTV; 75.77% accuracy / 72.94% F1-V on DVD) and claims a strong accuracy-efficiency tradeoff versus larger baselines.

Abstract

Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.

Black Hat Asia

AI Business

Research with ChatGPT

Dev.to

Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it

Reddit r/LocalLLaMA

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem

Dev.to

The 10 Best AI Tools for SEO and Digital Marketing in 2026

Dev.to

CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Key Points

Abstract

Related Articles

Black Hat Asia

Research with ChatGPT

Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem

The 10 Best AI Tools for SEO and Digital Marketing in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer