Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Motion forecasting models often face a tradeoff between interpretability and predictive accuracy, especially when using opaque anchor/latent queries prone to latent collapse or limited sampling diversity.
  • The proposed “Recall to Predict” framework grounds predictions in an interpretable Motion Bank: a structured embedding space of physically realizable trajectories learned via contrastive learning.
  • It introduces an Anchor Retrieval Layer that retrieves motion priors through dual-level gated cross-attention and uses a Straight-Through Gumbel-Softmax estimator to keep gradients flowing during discrete trajectory selection.
  • Retrieved motion primitives are further refined with a DETR-style decoder and trained jointly using a Winner-Takes-All kinematic Gaussian Mixture Model, diversity regularization, and a soft-min endpoint loss.
  • The method reports competitive multi-modal forecasting performance on Argoverse 2 and Waymo Open Motion and provides open code on GitHub.

Abstract

Motion forecasting often requires trading interpretability for predictive accuracy. Standard anchor-based architectures rely on opaque latent queries that are highly prone to latent collapse, or naive trajectory sampling that limits multi-modal diversity. We propose an end-to-end differentiable framework that grounds predictions in a comprehensive "motion bank", a structured embedding space of physically realizable trajectories constructed via contrastive learning. Rather than regressing paths from a blank slate, our architecture dynamically retrieves explicit motion priors using a novel Anchor Retrieval Layer. This module adapts orthogonally initialized queries via a Dual-Level Gated Cross-Attention mechanism and executes discrete trajectory selection using a Straight-Through Gumbel-Softmax estimator to preserve continuous gradient flow. The retrieved semantically grounded anchors are then geometrically refined by a DETR-style decoder, optimized jointly with a Winner-Takes-All (WTA) kinematic Gaussian Mixture Model (GMM), a latent diversity penalty, and a soft-min weighted endpoint loss. By strictly conditioning the decoding phase on diverse, interpretable motion primitives, our approach eliminates the "black box" of standard latent queries while achieving competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets. Code is available at: https://github.com/abviv/recall2predict

Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank | AI Navigate