This is part of a series on building a training-free home service robot using VLMs, RAG memory, and manifold learning. This post covers the camera architecture — specifically, why fixed ceiling-mounted nodes ended up as the foundation of the whole perception system.
Honestly, my first instinct was: why not just use the robot's onboard camera?
It's the obvious answer. The robot is already in the room. It already has a camera. Adding twelve fixed ceiling nodes sounds like unnecessary complexity — more hardware, more calibration, more failure points, for a system that was already complicated enough.
My advisor's requirement was firm: the AI pipeline must take its visual input from fixed global cameras, not from the robot itself. No negotiation on that point.
So I spent a while sitting with that constraint, trying to understand it rather than just comply with it. This post is what I figured out. It starts with the genuine question I had — why global cameras at all? — and ends with the engineering decisions I made once I accepted the answer.
The Problem with a Robot's Point of View
The case for onboard vision is intuitive: the robot is mobile, so its camera goes wherever the action is. But "wherever the action is" turns out to be exactly the problem.
A robot-mounted camera, positioned anywhere from 30cm to 100cm off the ground, has two fundamental problems.
Narrow field of view and constant occlusion. The robot sees the world from a low, mobile, first-person perspective. A sofa blocks the view of the person sitting behind it. The kitchen wall hides what's happening at the dining table. From the robot's perspective, the home is a maze of partial information.
Motion blur during navigation. When the robot is moving — which is most of the time — its onboard camera is not producing reliable still frames. Activity recognition from blurry, unstabilized video is significantly harder than recognition from a fixed viewpoint.
These aren't engineering failures. They're intrinsic to the geometry of a mobile, ground-level camera. No amount of better hardware changes the fundamental constraint.
The Architecture: Fixed Nodes + a Single Moving Camera
My solution was to separate global perception from local action. The robot handles physical interaction. A set of fixed-position camera nodes handles scene understanding.
In my system, twelve CameraNode objects are distributed across three rooms (four per room) at ceiling height — approximately 2.3m — simulating the kind of fixed IP camera array you might mount in a real home. These nodes don't move, don't occlude each other, and always have a stable, overhead view of the space.
But here's the key engineering constraint: I only have one physical camera in the scene. Rather than instantiating twelve separate cameras (expensive and redundant), I use a single camera that teleports to each selected node position, renders a frame, and moves on. The VirtualCameraBrain component manages this process.
for each selected node (top-N by score):
camera.transform ← node.position + node.rotation
wait 2 frames // GPU render flush
capture 512×512 PNG → Base64
POST all images to /predict in one request
This gives me multi-viewpoint coverage with minimal rendering overhead.
The Harder Problem: Which Node Do You Pick?
Having twelve nodes is useless if you pick the wrong one. A node behind the user, or one with the user at the edge of its field of view, produces an image the VLM can't interpret reliably.
I needed a scoring function. Here's what I settled on:
s_i = (v_i × 0.5 + α_i × 0.3 + d_i × 0.2) × m_i
Where:
- v_i — Visibility: does a linecast from the node to the user's chest reach without hitting furniture? (0 or 1)
- α_i — Angle factor: how centered is the user in the node's field of view? (1 at dead center, 0 at the FOV edge)
- d_i — Distance factor: linear decay from 0m to 10m
- m_i — A per-node priority multiplier, set in the Inspector
The most important rule comes before the weighted sum: hard FOV gating. If the user falls outside the node's field of view cone, that node gets score = 0 immediately, no matter how good its distance or linecast result. There's no point in a weighted calculation for a camera that can't even see the target.
After scoring, I sort candidates descending and capture from the top-2 nodes (configurable). Two viewpoints handle the cases where one node is slightly occluded.
Why This Matters for the VLM
The whole reason I care about viewpoint quality is downstream accuracy. My system uses llava-phi3 (via Ollama) to recognize what the user is doing — drinking, sitting, reading, typing — without any task-specific training.
VLMs are sensitive to image quality in ways that trained classifiers aren't. A trained activity recognition model can learn to compensate for partial occlusion if it sees enough occluded examples during training. A zero-shot VLM cannot — it has to interpret what it sees without that learned correction.
This means camera selection directly controls recognition accuracy. In early testing, episodes where the scoring system chose a poor viewpoint (user at the edge of frame, or partially behind furniture) produced VLM outputs like "a person standing near a wall" instead of "a person drinking from a bottle." The SBERT normalization layer handled some of this, but the better fix was improving the viewpoint selection upstream.
The Gap Between Simulation and Reality
I want to be honest about where my current implementation sits. Everything described above runs in a Unity 3D simulation. The "nodes" are virtual GameObjects. The "camera" is Unity's rendering engine. The coordinate streams come from DynamicSyncManager.cs, not from depth sensors or object detection.
This is intentional — I'm using simulation to validate the framework before committing to physical hardware. But it means two real-world problems remain unsolved:
Extrinsic calibration. In a real deployment, every pixel (u, v) from each fixed node must be mapped to a shared 3D coordinate system. This requires physical calibration of each camera's position and orientation relative to the room — a process that takes significant setup time and re-calibration whenever a camera is moved.
Latency compensation. Network transmission from a wall-mounted IP camera to the processing backend introduces roughly 50–150ms of latency. For a moving user, this means the position data you receive corresponds to where they were, not where they are. You need prediction — either simple linear extrapolation or a Kalman filter — to compensate.
My simulation sidesteps both of these by giving me ground-truth coordinates directly from the Unity scene. That's a real gap, and I'm documenting it explicitly in the thesis as a limitation.
Cooperative Offloading: What This Enables for the Robot
The architectural payoff of this design is that the robot itself doesn't need to run heavy vision inference. The fixed-node perception pipeline handles scene understanding and transmits lightweight metadata to the robot's decision layer:
{
"user_id": "User_Mom",
"action": "Reading",
"pos": [-0.17, 8.62],
"room": "LivingRoom",
"intent": "Drink",
"confidence": 0.74
}
The robot receives a pre-processed summary — who is where, what they're doing, and what they're likely to want next — rather than raw pixels. This is the core of the "ambient intelligence offloads to embodied intelligence" architecture.
For a battery-powered physical robot, this matters a lot. Running a VLM inference pipeline continuously on an embedded GPU drains a battery in under an hour. Running it on a wall-powered backend and sending metadata over Wi-Fi costs almost nothing on the robot side.
Perception Mode Comparison
| Feature | Onboard Camera | Fixed Node Array | What My System Does |
|---|---|---|---|
| Field of view | Local, low, easily occluded | Global, overhead, stable | Nodes handle scene understanding |
| Inference load | Runs on robot battery | Runs on wall-powered backend | VLM runs on backend only |
| Coordinate source | Estimated from robot odometry | Direct from scene/sensors | Unity scene (simulation) |
| Calibration | Built into robot | Requires room-level setup | Skipped in simulation; needed in real deployment |
| Failure mode | Occlusion, motion blur | Network latency, fixed FOV gaps | Fallback: retry with next-best node |
What's Next
The next post in this series covers how I use these captured images as input to a zero-shot VLM pipeline, and how SBERT semantic normalization maps the free-form VLM descriptions to canonical behavior labels — without any training data.
If you're building something similar, or have dealt with the extrinsic calibration problem in a real deployment, I'd love to hear how you approached it.
Part of the "Training-Free Home Robot" series. The full system integrates VLM perception, a Behavioral Scene Graph memory layer (FAISS + MongoDB), and UMAP manifold learning for proactive intent prediction.




