Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D
arXiv cs.CV / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Boxer, a transformer-based algorithm that lifts 2D open-vocabulary detections into static, metric 3D bounding boxes using posed images and optional depth (sparse point cloud or dense depth).
- BoxerNet forms the core lifting module, taking 2D bounding box proposals and producing 3D boxes that are then refined via multi-view fusion and geometric filtering to yield globally consistent, de-duplicated 3D results.
- The approach leverages existing 2D open-vocabulary detectors (e.g., DETIC, OWLv2, SAM3) so the main model focuses on 3D lifting, aiming to reduce reliance on costly 3D bounding-box annotation.
- The method extends a CuTR-style formulation by adding aleatoric uncertainty for more robust regression and supports sparse-depth inputs via median depth patch encoding; training uses over 1.2M unique 3D bounding boxes.
- Reported results show substantial gains over prior baselines, including large improvements in egocentric settings without dense depth and strong performance on CA-1M when dense depth is available.
Related Articles

Black Hat Asia
AI Business

Meta's latest model is as open as Zuckerberg's private school
The Register

AI fuels global trade growth as China-US flows shift, McKinsey finds
SCMP Tech

Why multi-agent AI security is broken (and the identity patterns that actually work)
Dev.to
BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.
Reddit r/artificial