Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

arXiv cs.CV / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that always-on edge cameras suffer cross-modal retrieval degradation because redundant frames crowd out correct matches in top-k results.
  • It proposes a streaming retrieval architecture that uses an on-device epsilon-net novelty filter to keep only semantically novel frames, forming a denoised embedding index.
  • To address alignment limitations from using a compact on-device encoder, the system adds a cross-modal adapter plus a cloud re-ranker.
  • In single-pass experiments, the approach outperforms several offline frame selection baselines (k-means, farthest-point, uniform, random) across eight vision-language models on two egocentric datasets (AEA and EPIC-KITCHENS).
  • The method reports strong retrieval quality (45.6% Hit@5 on held-out data) while running with an 8M on-device encoder at very low estimated power (2.7 mW).

Abstract

Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.