How I Designed a Camera Scoring System for VLM-Based Activity Recognition — and Why It Looks Different in the Real World

Dev.to / 3/31/2026

💬 Opinion

Key Points

  • The post explains how a training-free home-robot activity recognition system uses a zero-shot vision-language model (VLM), making viewpoint selection crucial because VLM accuracy depends directly on image quality.

Part 2 of the "Training-Free Home Robot" series. Part 1 covered why fixed ceiling-mounted nodes ended up as the perception foundation. This post goes deep on one specific algorithm: how the system decides which camera angle to use for each behavioral episode, and what that decision looks like when you leave the simulation.

Once I'd worked through why fixed global cameras made sense — a conclusion I reached the hard way, starting from genuine skepticism about the requirement — the next problem was entirely mine: given twelve candidate viewpoints, which one do you actually use?

My advisor specified the input modality. The selection algorithm, the scoring weights, the hard FOV gate, the fallback logic — none of that was given to me. This post is that design work: where each decision came from, what tradeoffs it makes, and what changes when you move from a Unity simulation to a real room.

The Core Problem

My system recognizes what a user is doing — drinking, reading, typing — by sending a camera image to a Vision-Language Model (VLM). The VLM is zero-shot: no training data, no fine-tuning. It just sees an image and describes what's happening.

This creates a hard dependency: VLM accuracy is directly tied to image quality, and image quality is directly tied to viewpoint selection.

A trained activity recognition model can partially compensate for bad viewpoints — it has seen thousands of occluded or off-angle examples during training. A zero-shot VLM cannot. If the user is at the edge of the frame, or partially behind furniture, the VLM produces unreliable output: "a person standing near a wall" instead of "a person drinking from a bottle."

So before any AI inference happens, the system needs to answer: which of the twelve available camera nodes will produce the most useful image right now?

Why Not Just Pick the Closest Node?

The naive approach is distance-only: pick the node closest to the user. But distance alone misses two critical failure modes.

Occlusion. A node 1.5m away from the user, directly behind a sofa, produces a completely blocked image. A node 4m away with a clear line of sight is far more useful.

Off-axis angle. A node positioned to the side of a user who is facing a desk will capture a profile view at best, and the back of the user's head at worst. VLMs strongly prefer frontal or near-frontal views for activity recognition — they're trained on internet images where people face the camera.

Distance matters, but it's one factor among three.

The Scoring Formula

I ended up with a weighted combination of three geometric factors, plus a hard gate that runs before any of them.

Step 0 — Hard FOV Gate

Before computing any score, I check whether the user even falls within the node's field of view cone. If not, the node is excluded immediately — score = 0, no further calculation.

if θ_i > FOV_i / 2  →  s_i = 0   (hard gate, skip remaining calculation)

where θ_i is the angle between the node's forward direction and the vector pointing toward the user's chest. The aim point is set at chest height: aim = user.position + (0, 1.2, 0).

This gate matters more than it might seem. Without it, the weighted formula can assign a non-zero score to a node that physically cannot see the user — it just happens to be close or have good visibility in a different direction. Hard gating eliminates this entire class of bad selections before the arithmetic starts.

Step 1 — Visibility Factor v_i

v_i = 1   if linecast(node → user chest) is unobstructed
      0   otherwise

A physics linecast from the node position to the user's chest. If it hits furniture or a wall, v_i = 0. This is binary — either there's a clear path or there isn't.

Weight: 0.5 — the highest weight, because an occluded node is nearly useless regardless of its other properties.

Step 2 — Angle Factor α_i

α_i = max(0,  1 - θ_i / (FOV_i / 2))

This maps the user's angular position within the FOV cone to a continuous score: 1.0 at dead center, 0.0 at the FOV boundary. A node where the user appears near the edge of frame gets a low angle score even if the linecast is clear.

Weight: 0.3

Step 3 — Distance Factor d_i

d_i = max(0,  1 - dist(node, user_chest) / 10)

Linear decay from 1.0 at 0m to 0.0 at 10m. I chose 10m as the decay constant after observing that the largest room in my simulation is about 6m across — so 10m means a node in the opposite corner of the largest room still gets a non-zero distance score, but it's clearly penalized.

Weight: 0.2 — lowest weight, because distance matters less than occlusion or angle.

Final Score

s_i = (v_i × 0.5 + α_i × 0.3 + d_i × 0.2) × m_i

m_i ∈ [0.5, 1.0] is a per-node multiplier set in the Unity Inspector, allowing me to manually downweight nodes with known limitations (e.g., a node that points toward a window and produces glare in afternoon light).

Nodes with s_i ≥ 0.50 are admitted to the candidate list, sorted descending. The top-2 are captured.

Pseudocode

function ScoreCamerasRanked(user, cameras, s_min=0.50):
    aimPos  user.position + [0, 1.2, 0]
    qualified  []

    for node in cameras:
        θ  angle(node.forward, aimPos - node.position)

        if θ > node.FOV / 2:
            node.score  0
            continue                    // hard FOV gate

        v  1 if Linecast(node.position, aimPos) clear else 0
        α  clamp(1 - θ / (node.FOV/2), 0, 1)
        d  clamp(1 - dist(node.position, aimPos) / 10, 0, 1)

        node.score  (v*0.5 + α*0.3 + d*0.2) * node.multiplier

        if node.score  s_min:
            qualified.append(node)

    return sort(qualified, key=score, descending=True)

Why This Design — The Honest Answer

I want to be direct about something: this scoring formula exists largely because of hardware constraints, not because it's the theoretically optimal solution.

My simulation runs on a single workstation. I have one physical camera in the Unity scene that teleports to each selected node position, renders a frame, and moves on. I could not run twelve simultaneous cameras without multiplying rendering cost by twelve. Even in simulation, I needed a fast, lightweight way to rank nodes without actually rendering from all of them first.

The weighted formula with three geometric factors fits that constraint perfectly:

  • It's O(N) where N = number of nodes — trivially fast even for N = 100
  • It uses only spatial coordinates and angles — no image rendering required
  • It's interpretable — when a node scores poorly, I can immediately see why (was it the occlusion? the angle? the distance?)

A more sophisticated approach would render a low-resolution thumbnail from each candidate node and run a quick quality assessment model on it before selecting. This would catch cases the geometric formula misses — a node with a clear linecast but the user facing directly away from it, for instance. But that requires N renders per selection decision, which was not feasible on my hardware.

The practical tradeoff: the geometric formula is fast and correct in the common case. It fails primarily when the user's facing direction is not aligned with the node's line of sight — a limitation I document explicitly in the thesis.

The Simulation vs. Reality Gap

Everything above runs in Unity. Translating this to a physical room with real IP cameras introduces three gaps that simulation completely sidesteps.

Gap 1: You Don't Know Where the User Is

In Unity, user.position is available as a ground-truth Vector3 — the exact world coordinates of the character, updated every frame.

In a real room, you don't have this. You need to estimate the user's position from the cameras themselves (using person detection + depth estimation or triangulation), from wearables, or from floor sensors. Each of these introduces estimation error that flows directly into the scoring formula.

Bridging approach: Use the fixed-node cameras to run a lightweight person detector (e.g., YOLOv8-nano) and estimate 2D floor position via homography. This gives approximate (x, z) coordinates sufficient for the scoring formula, even without depth sensors.

Gap 2: node.forward Requires Extrinsic Calibration

In Unity, every node's position and forward direction is set in the editor — exact, zero-error, always current. In a real room, you need to physically calibrate each camera's extrinsic parameters (position and orientation relative to a shared world coordinate frame).

Calibration drift is real: a camera that shifts 2cm from vibration or accidental contact changes its linecast origin enough to affect visibility calculations, particularly for borderline cases.

Bridging approach: ArUco marker-based calibration at installation time, with periodic re-verification. Store calibration parameters in a config file that feeds into the scoring formula at runtime. Flag nodes whose calibration is older than a threshold for re-calibration.

Gap 3: Linecast ≠ Real-World Occlusion

Unity's linecast is a perfect, instantaneous ray through a static collision mesh. In a real room, occlusion is dynamic (people, pets, moved furniture), partially transparent (glass tables, thin curtains), and probabilistic.

Bridging approach: Replace the binary linecast with a visibility probability estimated from the camera's own feed. If the selected node's image shows the user partially occluded in the previous frame, reduce its score for the current selection. This creates a feedback loop: actual image quality informs future node selection.

What the Scoring Looks Like in Practice

In the Unity simulation, I visualize node scores using Gizmos in the Scene View:

  • Green sphere — score ≥ 0.50, admitted to candidate list
  • Yellow sphere — score between 0.35 and 0.50, near threshold
  • Red sphere — score > 0 but below threshold
  • Gray sphere — FOV-gated, score = 0

During experiment setup, I use this visualization to verify that at least two nodes per room reliably score green for each behavioral spot (the sofa, the desk, the kitchen counter). If a room has only one reliably green node for a given spot, I reposition nodes before running experiments.

This debugging workflow — spatial visualization of scores before running inference — turned out to be as important as the formula itself. The formula is only as good as the node placement it operates on.

Summary

Design Decision Reason Real-World Equivalent
Hard FOV gate before weighted sum Prevents scoring nodes that can't see the user Same gate applies; requires accurate extrinsic calibration
Linecast for visibility Fast, exact in simulation Replace with visibility probability from live feed
Chest height aim point (1.2m) Captures torso, most informative for activity recognition Same; depth camera or pose estimator needed for accuracy
Top-2 node capture Handles single-node occlusion failures Same strategy; second node is insurance
Per-node multiplier m_i Manual override for known problem nodes Useful for flagging nodes with fixed environmental issues (glare, permanent obstruction)

The scoring formula is a pragmatic solution built around a specific hardware constraint: one rendering camera, twelve virtual viewpoints, a need for selection to be fast and interpretable. It works well in simulation, and the geometric logic transfers cleanly to a real deployment — but the inputs to the formula (user position, node orientation, occlusion) all need real-world measurement pipelines that simulation provides for free.

Next in the series: how the captured images feed into a zero-shot VLM pipeline, and how SBERT semantic normalization maps free-form VLM descriptions to canonical behavior labels without any training data.

Full thesis: "Personalized Proactive Service in Smart Home Robots: A Training-Free Visual Perception Framework Integrating VLM-Based Scene Grounding, RAG Memory, and Manifold Learning" — NCKU, 2025.