CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection
arXiv cs.CV / 4/7/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces CoLoRSMamba, a directional video-to-audio multimodal architecture that links a VideoMamba encoder with an AudioMamba module using CLS-guided conditional LoRA for scene-aware audio modeling.
- Instead of token-level cross-attention, the VideoMamba CLS token generates channel-wise modulation and a stabilization gate to adapt AudioMamba’s selective state-space parameters (including the step-size pathway).
- Training uses a combination of binary violence classification and a symmetric AV-InfoNCE contrastive objective to align clip-level audio and video embeddings.
- For fair evaluation under real-world conditions, the authors curate audio-filtered clip-level subsets of NTU-CCTV and DVD based on temporal annotations, keeping only clips where audio is available.
- On these subsets, CoLoRSMamba reports improved results (88.63% accuracy / 86.24% F1-V on NTU-CCTV; 75.77% accuracy / 72.94% F1-V on DVD) and claims a strong accuracy-efficiency tradeoff versus larger baselines.
Related Articles

Black Hat Asia
AI Business
Research with ChatGPT
Dev.to
Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it
Reddit r/LocalLLaMA

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem
Dev.to

The 10 Best AI Tools for SEO and Digital Marketing in 2026
Dev.to