Embodied AI: Why I Gave My Home Robot an "Eye in the Sky"

This is part of a series on building a training-free home service robot using VLMs, RAG memory, and manifold learning. This post covers the camera architecture — specifically, why fixed ceiling-mounted nodes ended up as the foundation of the whole perception system.

Honestly, my first instinct was: why not just use the robot's onboard camera?

It's the obvious answer. The robot is already in the room. It already has a camera. Adding twelve fixed ceiling nodes sounds like unnecessary complexity — more hardware, more calibration, more failure points, for a system that was already complicated enough.

My advisor's requirement was firm: the AI pipeline must take its visual input from fixed global cameras, not from the robot itself. No negotiation on that point.

So I spent a while sitting with that constraint, trying to understand it rather than just comply with it. This post is what I figured out. It starts with the genuine question I had — why global cameras at all? — and ends with the engineering decisions I made once I accepted the answer.

The Problem with a Robot's Point of View

The case for onboard vision is intuitive: the robot is mobile, so its camera goes wherever the action is. But "wherever the action is" turns out to be exactly the problem.

A robot-mounted camera, positioned anywhere from 30cm to 100cm off the ground, has two fundamental problems.

Narrow field of view and constant occlusion. The robot sees the world from a low, mobile, first-person perspective. A sofa blocks the view of the person sitting behind it. The kitchen wall hides what's happening at the dining table. From the robot's perspective, the home is a maze of partial information.

Motion blur during navigation. When the robot is moving — which is most of the time — its onboard camera is not producing reliable still frames. Activity recognition from blurry, unstabilized video is significantly harder than recognition from a fixed viewpoint.

These aren't engineering failures. They're intrinsic to the geometry of a mobile, ground-level camera. No amount of better hardware changes the fundamental constraint.

The Architecture: Fixed Nodes + a Single Moving Camera

My solution was to separate global perception from local action. The robot handles physical interaction. A set of fixed-position camera nodes handles scene understanding.

In my system, twelve CameraNode objects are distributed across three rooms (four per room) at ceiling height — approximately 2.3m — simulating the kind of fixed IP camera array you might mount in a real home. These nodes don't move, don't occlude each other, and always have a stable, overhead view of the space.

But here's the key engineering constraint: I only have one physical camera in the scene. Rather than instantiating twelve separate cameras (expensive and redundant), I use a single camera that teleports to each selected node position, renders a frame, and moves on. The VirtualCameraBrain component manages this process.

for each selected node (top-N by score):
    camera.transform ← node.position + node.rotation
    wait 2 frames                    // GPU render flush
    capture 512×512 PNG → Base64

POST all images to /predict in one request

This gives me multi-viewpoint coverage with minimal rendering overhead.

The Harder Problem: Which Node Do You Pick?

Having twelve nodes is useless if you pick the wrong one. A node behind the user, or one with the user at the edge of its field of view, produces an image the VLM can't interpret reliably.

I needed a scoring function. Here's what I settled on:

s_i = (v_i × 0.5 + α_i × 0.3 + d_i × 0.2) × m_i

Where:

v_i — Visibility: does a linecast from the node to the user's chest reach without hitting furniture? (0 or 1)
α_i — Angle factor: how centered is the user in the node's field of view? (1 at dead center, 0 at the FOV edge)
d_i — Distance factor: linear decay from 0m to 10m
m_i — A per-node priority multiplier, set in the Inspector

The most important rule comes before the weighted sum: hard FOV gating. If the user falls outside the node's field of view cone, that node gets score = 0 immediately, no matter how good its distance or linecast result. There's no point in a weighted calculation for a camera that can't even see the target.

After scoring, I sort candidates descending and capture from the top-2 nodes (configurable). Two viewpoints handle the cases where one node is slightly occluded.

Why This Matters for the VLM

The whole reason I care about viewpoint quality is downstream accuracy. My system uses llava-phi3 (via Ollama) to recognize what the user is doing — drinking, sitting, reading, typing — without any task-specific training.

VLMs are sensitive to image quality in ways that trained classifiers aren't. A trained activity recognition model can learn to compensate for partial occlusion if it sees enough occluded examples during training. A zero-shot VLM cannot — it has to interpret what it sees without that learned correction.

This means camera selection directly controls recognition accuracy. In early testing, episodes where the scoring system chose a poor viewpoint (user at the edge of frame, or partially behind furniture) produced VLM outputs like "a person standing near a wall" instead of "a person drinking from a bottle." The SBERT normalization layer handled some of this, but the better fix was improving the viewpoint selection upstream.

The Gap Between Simulation and Reality

I want to be honest about where my current implementation sits. Everything described above runs in a Unity 3D simulation. The "nodes" are virtual GameObjects. The "camera" is Unity's rendering engine. The coordinate streams come from DynamicSyncManager.cs, not from depth sensors or object detection.

This is intentional — I'm using simulation to validate the framework before committing to physical hardware. But it means two real-world problems remain unsolved:

Extrinsic calibration. In a real deployment, every pixel (u, v) from each fixed node must be mapped to a shared 3D coordinate system. This requires physical calibration of each camera's position and orientation relative to the room — a process that takes significant setup time and re-calibration whenever a camera is moved.

Latency compensation. Network transmission from a wall-mounted IP camera to the processing backend introduces roughly 50–150ms of latency. For a moving user, this means the position data you receive corresponds to where they were, not where they are. You need prediction — either simple linear extrapolation or a Kalman filter — to compensate.

My simulation sidesteps both of these by giving me ground-truth coordinates directly from the Unity scene. That's a real gap, and I'm documenting it explicitly in the thesis as a limitation.

Cooperative Offloading: What This Enables for the Robot

The architectural payoff of this design is that the robot itself doesn't need to run heavy vision inference. The fixed-node perception pipeline handles scene understanding and transmits lightweight metadata to the robot's decision layer:

{
  "user_id": "User_Mom",
  "action":  "Reading",
  "pos":     [-0.17, 8.62],
  "room":    "LivingRoom",
  "intent":  "Drink",
  "confidence": 0.74
}

The robot receives a pre-processed summary — who is where, what they're doing, and what they're likely to want next — rather than raw pixels. This is the core of the "ambient intelligence offloads to embodied intelligence" architecture.

For a battery-powered physical robot, this matters a lot. Running a VLM inference pipeline continuously on an embedded GPU drains a battery in under an hour. Running it on a wall-powered backend and sending metadata over Wi-Fi costs almost nothing on the robot side.

Perception Mode Comparison

Feature	Onboard Camera	Fixed Node Array	What My System Does
Field of view	Local, low, easily occluded	Global, overhead, stable	Nodes handle scene understanding
Inference load	Runs on robot battery	Runs on wall-powered backend	VLM runs on backend only
Coordinate source	Estimated from robot odometry	Direct from scene/sensors	Unity scene (simulation)
Calibration	Built into robot	Requires room-level setup	Skipped in simulation; needed in real deployment
Failure mode	Occlusion, motion blur	Network latency, fixed FOV gaps	Fallback: retry with next-best node

What's Next

The next post in this series covers how I use these captured images as input to a zero-shot VLM pipeline, and how SBERT semantic normalization maps the free-form VLM descriptions to canonical behavior labels — without any training data.

If you're building something similar, or have dealt with the extrinsic calibration problem in a real deployment, I'd love to hear how you approached it.

Part of the "Training-Free Home Robot" series. The full system integrates VLM perception, a Behavioral Scene Graph memory layer (FAISS + MongoDB), and UMAP manifold learning for proactive intent prediction.