M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

arXiv cs.CV / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces M2H-MX, a real-time multi-task dense visual perception model designed to improve monocular spatial understanding from a single camera stream for robotics.
It combines multi-scale feature preservation with register-gated global context and carefully controlled cross-task interactions in a lightweight decoder to support fast depth and semantic prediction under latency constraints.
The model’s depth and semantic outputs are integrated directly into an unmodified monocular SLAM pipeline via a compact perception-to-mapping interface, aiming for stable in-the-loop performance.
Experiments show substantial gains on NYUDv2, including a 6.6% improvement in semantic mIoU and a 9.4% reduction in depth RMSE versus multi-task baselines.
In real-time monocular mapping on ScanNet, M2H-MX cuts average trajectory error by 60.7% compared with a strong monocular SLAM baseline while producing cleaner metric-semantic maps.

Abstract

Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.