CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion
arXiv cs.RO / 4/1/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns directly from raw forward-facing depth rather than relying on explicit geometric intermediates like 2.5D terrain representations or auxiliary depth targets.
- CReF uses proprioception-queried cross-modal attention to fuse proprioceptive and depth tokens, followed by a gated residual fusion block to combine representations effectively.
- Temporal behavior is integrated via a GRU with a highway-style output gate that adaptively blends recurrent state features with feedforward features depending on the robot’s situation.
- To improve real-world terrain interaction, the method adds a terrain-aware foothold placement reward that uses foot-end point clouds to generate supportable foothold candidates and rewards touchdowns near the nearest feasible candidate.
- Experiments report robust traversal in both simulation and on a physical humanoid, including zero-shot transfer to real scenes with handrails, hollow structures, reflective interference, and visually cluttered outdoor environments.
Related Articles

Black Hat Asia
AI Business

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

How to Create AI Videos in 20 Minutes (3 Free Tools, Zero Experience)
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to