A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

arXiv cs.RO / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper addresses indoor mobile robots’ inability to follow natural-language intent instructions, proposing a framework that integrates vision-language reasoning with existing navigation stacks like ROS 2 Navigation 2.
  • It introduces a “Semantic Autonomy Stack” with hybrid deterministic reasoning plus VLM reasoning, using a 7-step parametric resolver to handle most instructions quickly without calling a language model, camera, or GPU.
  • Ambiguous instructions only trigger slower VLM-based reasoning (2–9 seconds on consumer hardware), improving practical deployment latency while retaining semantic understanding.
  • To overcome session-by-session amnesia, it adds a cross-robot adaptive semantic memory system with explicit scope categories (global environment, operator preferences, robot capabilities) that transfers learned preferences between robots.
  • Experiments on two differential-drive robots on Raspberry Pi 5 (no onboard GPU) report 100% semantic transfer and resolution accuracy across multiple sessions, along with concurrent multi-robot feasibility and an extreme measured latency reduction via deterministic resolution and shared compiled digests.

Abstract

Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware. A seven-step parametric resolver handles 88% of instructions in under 0.1 milliseconds without invoking a language model, camera, or GPU; only genuinely ambiguous instructions escalate to VLM reasoning. A five-category semantic memory framework with explicit scope taxonomy (global environment knowledge, per-operator preferences, per-robot capabilities) enables cross-session learning and cross-robot knowledge transfer: preferences learned through VLM interactions on one robot are promoted to deterministic resolution and transferred to a second robot via a shared compiled digest, achieving a measured latency reduction of 103,000-fold. Experimental validation on two custom-built differential-drive robots across 82 scenario-level decisions and three sessions demonstrates 100% semantic transfer accuracy (33/33, 95% CI [0.894, 1.000]), 100% semantic resolution accuracy, and concurrent multi-robot operation feasibility - all on Raspberry Pi 5 platforms with no onboard GPU, requiring zero training data.