A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis

arXiv cs.CV / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard two-stream action recognition models use identical backbones for RGB and optical flow despite their structural differences, which can waste modality-specific information.
  • It proposes DualStreamHybrid, using a pretrained ViT-Tiny/16 backbone for RGB and a scratch-trained MobileNetV2 for a 20-channel stacked optical-flow representation, with learned projection to a shared feature size before fusion.
  • The authors evaluate five fusion strategies—late fusion, concatenation, cross-attention, weighted fusion, and gated fusion—within a unified framework and analyze how fusion behavior changes with dataset size.
  • On UCF11, cross-attention delivers 98.12% test accuracy, beating an RGB-only ViT-Tiny baseline (95.94%), while on UCF50 weighted fusion performs best and is the most consistent across benchmarks.
  • Learned fusion weights show that modality contributions are nearly balanced on UCF11 (RGB 0.507 vs flow 0.493) but shift toward RGB on UCF50 (RGB 0.554 vs flow 0.446), suggesting dataset complexity influences the best fusion approach.

Abstract

Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.