A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis
arXiv cs.CV / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard two-stream action recognition models use identical backbones for RGB and optical flow despite their structural differences, which can waste modality-specific information.
- It proposes DualStreamHybrid, using a pretrained ViT-Tiny/16 backbone for RGB and a scratch-trained MobileNetV2 for a 20-channel stacked optical-flow representation, with learned projection to a shared feature size before fusion.
- The authors evaluate five fusion strategies—late fusion, concatenation, cross-attention, weighted fusion, and gated fusion—within a unified framework and analyze how fusion behavior changes with dataset size.
- On UCF11, cross-attention delivers 98.12% test accuracy, beating an RGB-only ViT-Tiny baseline (95.94%), while on UCF50 weighted fusion performs best and is the most consistent across benchmarks.
- Learned fusion weights show that modality contributions are nearly balanced on UCF11 (RGB 0.507 vs flow 0.493) but shift toward RGB on UCF50 (RGB 0.554 vs flow 0.446), suggesting dataset complexity influences the best fusion approach.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™
Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to