Visually-grounded Humanoid Agents
arXiv cs.RO / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes “Visually-grounded Humanoid Agents,” aiming to let digital humans actively behave in novel 3D scenes using only visual observations and specified goals, rather than scripted control or privileged state.
- It introduces a two-layer world-agent framework: a World Layer that reconstructs semantically rich 3D Gaussian scenes from real-world videos and supports animatable Gaussian-based human avatars, paired with an Agent Layer for autonomous humanoid control.
- The Agent Layer equips avatars with first-person RGB-D perception to perform embodied planning with spatial awareness and iterative reasoning, which is executed via low-level full-body actions.
- The authors also release a benchmark for evaluating humanoid-scene interaction across diverse reconstructed environments.
- Experiments report more robust autonomous behavior, with higher task success rates and fewer collisions than ablations and competing planning methods, and the authors plan to open-source data, code, and models.
Related Articles

Black Hat Asia
AI Business

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works
Dev.to
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial