ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control
arXiv cs.CV / 3/18/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- It presents ECHO, an edge-cloud framework for language-driven whole-body control of humanoid robots, linking a cloud diffusion-based text-to-motion generator with an edge RL tracker in closed loop.
- The motion is encoded in a compact 38-dimensional representation and generated by a 1D UNet with cross-attention on CLIP features, enabling rapid inference (about one second on cloud GPU with 10 denoising steps).
- The tracker uses a Teacher-Student paradigm with sim-to-real transfer via an evidential adaptation module, domain randomization, and symmetry constraints, plus an autonomous fall recovery mechanism using onboard IMU and library trajectories.
- Evaluations on HumanML3D show strong generation quality (FID 0.029, R-Precision Top-1 0.686), while real-world tests on a Unitree G1 demonstrate stable command execution without hardware fine-tuning.



