Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

arXiv cs.RO / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The study addresses a key limitation of deep reinforcement learning (DRL) navigation—overfitting to the limited structure of manually designed training environments—by using procedurally generated maps with guaranteed navigability.
  • The authors built MuRoSim, integrating four procedural map generator types (sparse, maze, graph, and Wave Function Collapse) and systematically tested five navigation policies via cross-generator transfer over many seeded maps.
  • Cross-generator transfer proved highly asymmetric: a policy specialized to sparse layouts dropped to 3.3% success on maze maps, while training on a combined generator set produced much stronger generalization (about 91.5% mean success).
  • Robustness was driven mainly by A* path-planner subgoal inputs: success rose from a feedforward baseline (~90.2%) to ~98.9%, outperforming GRU recurrence, which offered limited gains beyond reactive performance.
  • In comparison to a classical Carrot+A* controller and in RoboMaster real-world tests, learned DRL policies showed clear advantages—especially speed adaptation—with performance remaining high at higher speeds where classical control degraded sharply.

Abstract

Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation. We cross-evaluate five navigation policies on 1000 seeded maps per generator across three training seeds. Results show a strongly asymmetric cross-generator transfer: a specialist trained on sparse layouts falls to 3.3% success on mazes, whereas a policy trained on the combined generator set achieves 91.5 +/- 1.1% mean success. We further demonstrate that A* path-planner subgoal inputs are the dominant factor for robustness, raising success from the 90.2 +/- 1.4% feedforward baseline to 98.9 +/- 0.4% and outperforming GRU recurrence, which only improves the reactive baseline. The DRL policies outperform a classical Carrot+A* controller, which matches their success only at low speeds (1.0 m/s) but collapses to 24.9% at 2.0 m/s. This highlights learned speed adaptation as the decisive advantage of the learned approach. Real-world experiments on a RoboMaster confirm sim-to-real transfer in a cluttered arena, while a maze-like layout exposes remaining failure modes that recurrence helps mitigate.