RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
arXiv cs.CV / 4/30/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- RADIO-ViPE is a new online semantic SLAM system that performs geometry-aware open-vocabulary grounding by linking natural-language queries to localized 3D regions and objects in dynamic environments.
- Unlike prior methods that depend on calibrated, posed RGB-D inputs, it works directly from raw monocular RGB video without requiring camera intrinsics, depth sensors, or pose initialization.
- The approach tightly couples multi-modal vision-language embeddings from agglomerative foundation models (e.g., RADIO) with geometric scene information during initialization, optimization, and factor-graph construction to improve cross-modal map consistency.
- It uses adaptive robust kernels to handle both actively moving objects and agent-displaced scene changes (such as rearranged furniture during ego-centric sessions).
- Experiments show state-of-the-art performance on the dynamic TUM-RGBD benchmark and competitive results versus offline open-vocabulary methods that assume calibrated sensors and mostly static scenes.
Related Articles
Building a Local AI Agent (Part 2): Six UX and UI Design Challenges
Dev.to
The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay
Dev.to
We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to
Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...
Dev.to
Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to