CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

arXiv cs.RO / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns directly from raw forward-facing depth rather than relying on explicit geometric intermediates like 2.5D terrain representations or auxiliary depth targets.
  • CReF uses proprioception-queried cross-modal attention to fuse proprioceptive and depth tokens, followed by a gated residual fusion block to combine representations effectively.
  • Temporal behavior is integrated via a GRU with a highway-style output gate that adaptively blends recurrent state features with feedforward features depending on the robot’s situation.
  • To improve real-world terrain interaction, the method adds a terrain-aware foothold placement reward that uses foot-end point clouds to generate supportable foothold candidates and rewards touchdowns near the nearest feasible candidate.
  • Experiments report robust traversal in both simulation and on a physical humanoid, including zero-shot transfer to real scenes with handrails, hollow structures, reflective interference, and visually cluttered outdoor environments.

Abstract

Stable traversal over geometrically complex terrain increasingly requires exteroceptive perception, yet prior perceptive humanoid locomotion methods often remain tied to explicit geometric abstractions, either by mediating control through robot-centric 2.5D terrain representations or by shaping depth learning with auxiliary geometry-related targets. Such designs inherit the representational bias of the intermediate or supervisory target and can be restrictive for vertical structures, perforated obstacles, and complex real-world clutter. We propose CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns locomotion-relevant features directly from raw forward-facing depth without explicit geometric intermediates. CReF couples proprioception and depth tokens through proprioception-queried cross-modal attention, fuses the resulting representation with a gated residual fusion block, and performs temporal integration with a Gated Recurrent Unit (GRU) regulated by a highway-style output gate for state-dependent blending of recurrent and feedforward features. To further improve terrain interaction, we introduce a terrain-aware foothold placement reward that extracts supportable foothold candidates from foot-end point-cloud samples and rewards touchdown locations that lie close to the nearest supportable candidate. Experiments in simulation and on a physical humanoid demonstrate robust traversal over diverse terrains and effective zero-shot transfer to real-world scenes containing handrails, hollow pallet assemblies, severe reflective interference, and visually cluttered outdoor surroundings.