Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The study proposes compressing large foundation-model components for precision livestock farming so they can run on commodity edge accelerators with limited GPU memory.
  • It distills SAM 3’s 446M-parameter perception encoder into a 40.66M-parameter multi-scale student using a TinyViT-based feature pyramid encoder, a four-term direction-then-scale distillation loss, and sliding-window inference with session pruning to cap streaming memory growth.
  • It also leverages a DINOv3-family pre-distilled ViT-S/16 variant (about 21.6M parameters) as the per-animal embedder, positioned alongside a much larger ViT-7B teacher for distillation support.
  • Experiments on the Edinburgh Pig dataset show strong agreement with the SAM 3 teacher while substantially reducing system size and peak VRAM, and the approach also supports multi-class pig behavior classification with high accuracy.
  • The resulting pipeline is demonstrated to fit within an NVIDIA Jetson Orin NX 16GB setup and outlines an (unvalidated) on-device embedding-pool re-identification mechanism to build longitudinal visual records for downstream outcome association (e.g., disease and lameness).

Abstract

Foundation-model pipelines for individual-level livestock monitoring -- combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings -- have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 is distilled into a 40.66M-parameter multi-scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT-21M-512, a four-term direction-then-scale distillation loss, and backbone-substitution inference with sliding-window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre-distilled ViT-S/16 variant (21.6M parameters) released alongside a 6716M-parameter ViT-7B teacher; the ViT-S (21M) variant is adopted as the per-individual embedder. On the Edinburgh Pig dataset, the compressed pipeline reaches 92.29% MOTA and 96.15% IDF1 against the SAM 3 teacher (1.68- and 0.84-percentage-point losses), achieves a 7.77-fold reduction in system-level parameters and a 3.01-fold reduction in peak VRAM (19.52GB -> 6.49GB), and reaches 97.34% top-1 accuracy with 91.67% macro-F1 on nine-class pig behaviour classification. The pipeline fits inside an NVIDIA Jetson Orin NX 16GB envelope with 4.9GB of headroom, supporting a proposed -- but not yet empirically validated -- on-device embedding-pool re-identification mechanism whose per-individual footprint of approximately 94MB per animal per year produces a longitudinal visual record amenable to retrospective association with disease, lameness, reproductive, and growth outcome labels.