Adaptive Geodesic Conformal Prediction for Egocentric Camera Pose Estimation

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates conformal prediction for egocentric camera pose estimation in AR/assistive settings and finds standard fixed-threshold CP undercovers the hardest 25% of frames (about 60% coverage vs. nominal 90%).
It shows that using a geodesic SE(3) nonconformity score better identifies physically difficult frames than a Euclidean score, with low Q4 overlap and noticeably larger true camera displacement on the geodesic-selected hardest frames.
To address the conditional coverage gap, the authors propose DINOv2-Bridge adaptive conformal prediction with a two-stage difficulty estimator that transfers across participants without using any images at test time.
Experiments on EPIC-Fields report that Q4 coverage improves from roughly 0.75 to about 0.93 while keeping overall coverage near the 90% target, across multiple predictors and horizons.
Overall, the work demonstrates that adaptive difficulty estimation plus an appropriate geometry-aware nonconformity score can restore strong uncertainty guarantees specifically on difficult egocentric frames.

Abstract

Egocentric pose estimation for Augmented Reality (AR) and assistive devices requires not just accurate predictions but guaranteed uncertainty regions. Conformal prediction (CP) provides such guarantees without retraining, but we show that standard CP with a single fixed threshold achieves nominal 90% overall coverage while covering only ~60% of the hardest 25% of frames (Q4) -- a ~30 percentage-point conditional coverage gap consistent across 12 participants, 3 predictors, and 3 horizons (108 evaluations) on EPIC-Fields. We further show that a geodesic SE(3) nonconformity score identifies physically harder frames than Euclidean scoring, with only 15-26% Q4 overlap and 2-3x higher ground-truth camera displacement for geodesic Q4 frames. To close the coverage gap, we propose DINOv2-Bridge adaptive CP: a two-stage difficulty estimator trained on a single source participant that transfers cross-participant without any images at test time, improving Q4 coverage from ~0.75 to ~0.93 while maintaining overall coverage at the 90% target.