ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning

arXiv cs.CV / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • ViFiCon proposes a self-supervised contrastive learning method to learn cross-modal associations between vision (RGB-D) and wireless signals (WiFi FTM from smartphones).
  • The approach leverages natural alignment in pedestrian depth sequences from camera footage while handling the weaker linkage in wireless data (associated to a smartphone) to match vision bounding boxes to specific devices.
  • It constructs temporal representations by stacking multi-person depth sequences within image representations and uses a scene-wide synchronization pretext task to train without hand-labeled cross-modal associations.
  • Experiments on pedestrian data show strong vision-to-wireless association performance—92.63% accuracy using a 25-frame (2.5s) sliding window—while avoiding the need for fully supervised training data.
  • The authors argue the method is practical for real-world systems where wireless annotations are scarce, reducing privacy and energy costs by not transmitting IMU data.

Abstract

We introduce ViFiCon, a self-supervised contrastive scheme which learns a cross-modal association between vision and wireless modalities. Specifically, the system uses pedestrian data collected from RGB-D camera footage and WiFi Fine Time Measurements (FTM) from a user's smartphone device. Depth data from RGB-D (vision domain) is inherently linked with an observable pedestrian, but FTM data (wireless domain) is associated only to a smartphone on the network. We represent temporal sequences from both vision and wireless domains by stacking multi-person depth data sequences within an image representation. This simplicity allows both scene-wide processing and fewer vision and wireless features, alleviating privacy and energy associated with transmitting IMU data. To facilitate self-supervised learning, we design a scene-wide synchronization pretext task for our network and then employ the learned representation for the downstream multimodal association task. We show that compared to fully supervised state-of-the-art models, ViFiCon achieves high performance vision-to-wireless association of 92.63% in 25 frames sliding window fashion (2.5s), finding which bounding box corresponds to which smartphone device, without hand-labeled association examples for training data. Extensive experimental results demonstrate ViFiCon applicability in real-world systems when wireless data annotations are scarce.