M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding
arXiv cs.CV / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces M2H-MX, a real-time multi-task dense visual perception model designed to improve monocular spatial understanding from a single camera stream for robotics.
- It combines multi-scale feature preservation with register-gated global context and carefully controlled cross-task interactions in a lightweight decoder to support fast depth and semantic prediction under latency constraints.
- The model’s depth and semantic outputs are integrated directly into an unmodified monocular SLAM pipeline via a compact perception-to-mapping interface, aiming for stable in-the-loop performance.
- Experiments show substantial gains on NYUDv2, including a 6.6% improvement in semantic mIoU and a 9.4% reduction in depth RMSE versus multi-task baselines.
- In real-time monocular mapping on ScanNet, M2H-MX cuts average trajectory error by 60.7% compared with a strong monocular SLAM baseline while producing cleaner metric-semantic maps.
Related Articles

Black Hat Asia
AI Business

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

How to Create AI Videos in 20 Minutes (3 Free Tools, Zero Experience)
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to