A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a two-stage transformer-based temporal action localization framework for detecting hazardous distracted-driver behaviors from in-cabin video streams, aimed at periodic inspection settings like checkpoints and fleet safety systems.
  • It uses VideoMAE-based feature extraction followed by an AMA (Augmented Self-Mask Attention) detector, with an SPPF (Spatial Pyramid Pooling-Fast) module to improve multi-scale temporal feature capture.
  • Experiments show a clear accuracy–efficiency trade-off: ViT-Giant achieves 88.09% Top-1 test accuracy, while a lighter ViT-based variant reaches 82.55% with far lower computational fine-tuning costs.
  • For the localization task, adding SPPF improves performance across configurations, with the ViT-Giant + SPPF model reaching 92.67% mAP and the lightweight ViT configuration maintaining strong results.
  • The results suggest that model capacity can be tuned depending on deployment constraints, enabling safer driver monitoring with controlled inference/compute requirements.

Abstract

The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.

A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors | AI Navigate