Towards Intrinsic-Aware Monocular 3D Object Detection
arXiv cs.CV / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses monocular 3D object detection’s sensitivity to camera intrinsics and its limited generalization across different camera setups.
- It proposes MonoIA, a unified intrinsic-aware framework that treats intrinsic changes as perceptual transformations affecting apparent scale, perspective, and geometry.
- MonoIA uses large language models and vision-language models to produce intrinsic embeddings, then integrates them hierarchically into the detection network via an Intrinsic Adaptation Module to adapt features per camera.
- The approach reframes intrinsic modeling from numeric conditioning to semantic representation to achieve more consistent 3D detection across cameras.
- Experiments report new state-of-the-art results on KITTI, Waymo, and nuScenes, including +1.18% on the KITTI leaderboard and +4.46% on KITTI Val under multi-dataset training.


