CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting
arXiv cs.CV / 4/20/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The paper introduces CollideNet, a hierarchical spatiotemporal transformer architecture designed specifically for time-to-collision (TTC) forecasting from video.
- CollideNet uses a spatial stream that aggregates per-frame information at multiple resolutions and a temporal stream that performs multi-scale feature encoding.
- The temporal modeling includes disentanglement of non-stationarity, trend, and seasonality components to better capture time-varying dynamics.
- The authors report state-of-the-art results on three public TTC-related datasets, with a sizable margin over prior methods, and provide code for reproducibility.
- Cross-dataset evaluations and visualizations are used to study generalization and the effect of the trend/seasonality disentanglement.



