Counting Without Numbers \& Finding Without Words

arXiv cs.CL / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current shelter reunification systems fail because they rely largely on visual appearance, even though animals often recognize each other acoustically through identity sounds.
  • It proposes the first multimodal reunification system that combines visual matching with acoustic biometrics to better detect pairs across stress-related appearance changes.
  • The described model is species-adaptive, handling a wide acoustic range from low-frequency elephant rumbles (around 10Hz) to higher-frequency puppy whines (up to 4kHz).
  • The approach is framed as being grounded in decades of cognitive science about approximate quantity perception and identity communication via sound, using probabilistic matching for robustness.
  • The authors position the work as an example of biology-grounded AI that could improve outcomes for vulnerable populations that cannot communicate with human language.

Abstract

Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.