Self-calibrating cross-camera homography for real-time ghost prediction in multi-camera person tracking[P]

Reddit r/MachineLearning / 5/1/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The work addresses multi-camera person tracking failures when one camera loses a target by replacing naive pixel extrapolation with a self-calibrated cross-camera homography that accounts for differing coordinate systems.
  • It learns a 3x3 homography matrix by collecting foot-point correspondences from moments when both cameras simultaneously observe the same person, using HSV appearance matching with EMA smoothing and then estimating H via cv2.findHomography() + RANSAC.
  • The system continuously re-learns the homography every 5 new correspondence pairs and monitors reprojection error to automatically flush H if it degrades, enabling real-time “ghost prediction” during temporary occlusions.
  • It provides three prediction fallbacks—homography-based projection, adaptive pixel extrapolation, and a world-coordinate pinhole projection from a fused 3D Kalman state—while also improving robustness through trust-weighted sensor updates and DeepSORT-based tracking with Hungarian assignment fallbacks.
  • The implementation (with unit tests and CI) reports very low computational cost for homography updates and per-prediction projection, and notes limitations such as breakdown for steep/elevated camera angles and weak HSV-based Re-ID for similar-looking people at close distances.

The problem: In multi-camera tracking, when camera A loses track of a person but camera B still sees them, naive approaches extrapolate pixel coordinates linearly. This fails immediately because cameras have completely different coordinate systems. A person at pixel (400, 300) on camera B might be at (800, 500) on camera A, depending on relative position and angle.

Approach: When both cameras simultaneously observe the same person (matched via 64-dim HSV appearance descriptors, L2-normalized, EMA-smoothed at alpha=0.3), we record foot-point correspondence pairs. Bottom-center of the bounding box in each view projects to the same physical ground-plane point.

After 4+ such pairs, cv2.findHomography() + RANSAC gives a 3x3 matrix H mapping camera B pixel space to camera A. System auto-relearns every 5 new pairs and monitors reprojection error, flushing H if it spikes (camera moved).

Three fallback paths:

  • Path A (H-PROJ, green): homography projection from any source camera with valid H. Most accurate.
  • Path B (EXTRAP, red): pixel extrapolation with adaptive budget min(250px, 80 + 40*t). Last resort.
  • Path C (WORLD, orange): world-coordinate pinhole projection from fused 3D Kalman state. Always available.

Costs:

  • Homography re-estimation: < 0.1ms (called every 5 new pairs)
  • Per-prediction projection: < 0.001ms

Tracking: Hungarian assignment with 0.6 * IoU + 0.4 * cosine appearance cost. DeepSORT (MobileNet) as primary, falls back to Hungarian (scipy), then centroid.

Sensor trust: Each camera earns trust [0.1, 1.0] via consistency. High-innovation measurements get down-weighted. Kalman measurement noise R scales per update based on confidence, bbox area, and sensor trust.

Full implementation: github.com/mandarwagh9/overwatch. 57 unit tests covering Kalman, homography, tracking. CI on GitHub Actions.

Limitations: ground-plane homography breaks for elevated cameras with steep angles. Re-ID via HSV histograms is weak for people in similar clothing at close spatial proximity.

Curious if anyone has tackled non-ground-plane cross-camera projection or used learned embeddings instead of HSV histograms for re-ID at this inference budget.

submitted by /u/Straight_Stable_6095
[link] [comments]