Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a lightweight, socially informed architecture for pedestrian intention prediction that fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders and a compact 4-token Transformer.
It incorporates uncertainty estimation via a variational bottleneck and a Mahalanobis distance detector to provide calibrated probabilities and actionable risk scores.
On PSI 1.0, it outperforms recent vision-language models with 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC using only structured features; on PSI 2.0, it establishes a strong baseline of 0.78 F1 and 0.79 AUC-ROC with selective prediction improving accuracy at 80% coverage.
The approach is modality-agnostic, easy to integrate with vision-language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.

Abstract

Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.

How AI is Transforming Dynamics 365 Business Central

Dev.to

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Reddit r/artificial

Do I need different approaches for different types of business information errors?

Dev.to

ShieldCortex: What We Learned Protecting AI Agent Memory

Dev.to

How AI-Powered Revenue Intelligence Transforms B2B Sales Teams

Dev.to

Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion

Key Points

Abstract

Related Articles

How AI is Transforming Dynamics 365 Business Central

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Do I need different approaches for different types of business information errors?

ShieldCortex: What We Learned Protecting AI Agent Memory

How AI-Powered Revenue Intelligence Transforms B2B Sales Teams

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer