SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

arXiv cs.RO / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SigLoMa, a fully onboard ego-centric vision system for open-world quadrupedal loco-manipulation (pick-and-place), aiming to remove reliance on external motion capture and off-board compute.
It addresses key limitations of traditional exteroception-based RL—sample inefficiency and sim-to-real gaps—by using “Sigma Points,” a lightweight geometric representation that supports scalable exteroception and native sim-to-real alignment.
To reconcile slow visual perception with fast floating-base control, SigLoMa uses an ego-centric Kalman filter for robust high-rate state estimation.
The learning approach improves efficiency and robustness through an Active Sampling Curriculum guided by Hint Poses, and it mitigates structural visual blind spots via temporal encoding plus simulated random-walk drift.
Real-world experiments show that using only a 5Hz (200 ms latency) open-vocabulary detector, SigLoMa can perform dynamic loco-manipulation across multiple tasks with results comparable to expert human teleoperation.

Abstract

Designing an open-world quadrupedal loco-manipulation system is highly challenging. Traditional reinforcement learning frameworks utilizing exteroception often suffer from extreme sample inefficiency and massive sim-to-real gaps. Furthermore, the inherent latency of visual tracking fundamentally conflicts with the high-frequency demands of precise floating-base control. Consequently, existing systems lean heavily on expensive external motion capture and off-board computation. To eliminate these dependencies, we present SigLoMa, a fully onboard, ego-centric vision-based pick-and-place framework. At the core of SigLoMa is the introduction of Sigma Points, a lightweight geometric representation for exteroception that guarantees high scalability and native sim-to-real alignment. To bridge the frequency divide between slow perception and fast control, we design an ego-centric Kalman Filter to provide robust, high-rate state estimation. On the learning front, we alleviate sample inefficiency via an Active Sampling Curriculum guided by Hint Poses, and tackle the robot's structural visual blind spots using temporal encoding coupled with simulated random-walk drift. Real-world experiments validate that, relying solely on a 5Hz (200 ms latency) open-vocabulary detector, SigLoMa successfully executes dynamic loco-manipulation across multiple tasks, achieving performance comparable to expert human teleoperation.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

SIFS (SIFS Is Fast Search) - local code search for coding agents

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer