I built a complete vision system for humanoid robots

Reddit r/artificial / 4/1/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Key Points

  • The author released an open-source vision system for humanoid robots that runs locally on an NVIDIA Jetson Orin Nano with ROS2 integration.
  • The system enables real-time scene understanding such as estimating object distance, detecting people position, identifying gestures like waving, and supporting person-following based on an open-palm “owner” cue.
  • It combines multiple computer-vision components including YOLO11n for object detection, MiDaS for depth estimation, and MediaPipe for face detection, hand gestures, and pose estimation, plus tracking.
  • Reported performance ranges from 10–15 FPS by default to 30–40 FPS using INT8 optimization, aiming for a ~30 FPS real-time target.
  • The project emphasizes an “edge-first” and “privacy-first” philosophy (no data leaving the device) and provides quick-start instructions and a GitHub repo for community feedback.

I'm excited to an open-source vision system I've been building for humanoid robots. It runs entirely on NVIDIA Jetson Orin Nano with full ROS2 integration.

The Problem

Every day, millions of robots are deployed to help humans. But most of them are blind. Or dependent on cloud services that fail. Or so expensive only big companies can afford them.

I wanted to change that.

What OpenEyes Does

The robot looks at a room and understands:

- "There's a cup on the table, 40cm away"

- "A person is standing to my left"

- "They're waving at me - that's a greeting"

- "The person is sitting down - they might need help"

- Object Detection (YOLO11n)

- Depth Estimation (MiDaS)

- Face Detection (MediaPipe)

- Gesture Recognition (MediaPipe Hands)

- Pose Estimation (MediaPipe Pose)

- Object Tracking

- Person Following (show open palm to become owner)

Performance

- All models: 10-15 FPS

- Minimal: 25-30 FPS

- Optimized (INT8): 30-40 FPS

Philosophy

- Edge First - All processing on the robot

- Privacy First - No data leaves the device

- Real-time - 30 FPS target

- Open - Built by community, for community

Quick Start

git clone https://github.com/mandarwagh9/openeyes.git

cd openeyes

pip install -r requirements.txt

python src/main.py --debug

python src/main.py --follow (Person following!)

python src/main.py --ros2 (ROS2 integration)

The Journey

Started with a simple question: Why can't robots see like we do?

Been iterating for months fixing issues like:

- MediaPipe detection at high resolution

- Person following using bbox height ratio

- Gesture-based owner selection

Would love feedback from the community!

GitHub: github.com/mandarwagh9/openeyes

submitted by /u/Straight_Stable_6095
[link] [comments]