Frozen Vision Transformers for Dense Prediction on Small Datasets: A Case Study in Arrow Localization

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces an automated arrow puncture detection, localization, and scoring system for indoor archery targets using only 48 annotated photos.
Its pipeline uses a color-based rectification step to normalize perspective and a frozen self-supervised vision transformer (DINOv3 ViT-L/16) with AnyUp-guided feature upsampling to recover sub-millimeter precision from small patch tokens.
Lightweight CenterNet-style heatmap detection heads are trained with only 3.8M trainable parameters out of 308M total, enabling strong sample efficiency.
Experiments across three cross-validation folds report a mean F1 of 0.893 (±0.011) and a mean localization error of 1.41 (±0.06) mm, with downstream scoring errors around a 1.8% median error.
An ablation study finds that an offset regression head (commonly used for sub-pixel refinement) adds little detection benefit and can hurt localization, suggesting the guided upsampling already restores spatial precision lost in tokenization.

Abstract

We present a system for automated detection, localization, and scoring of arrow punctures on 40\,cm indoor archery target faces, trained on only 48 annotated photographs (5{,}084 punctures). Our pipeline combines three components: a color-based canonical rectification stage that maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements; a frozen self-supervised vision transformer (DINOv3 ViT-L/16) paired with AnyUp guided feature upsampling to recover sub-millimeter spatial precision from

32 \times 32

patch tokens; and lightweight CenterNet-style detection heads for arrow-center heatmap prediction. Only 3.8\,M of 308\,M total parameters are trainable. Across three cross-validation folds, we achieve a mean F1 score of

0.893 \pm 0.011

and a mean localization error of

1.41 \pm 0.06

\,mm, comparable to or better than prior fully-supervised approaches that require substantially more training data. An ablation study shows that the CenterNet offset regression head, typically essential for sub-pixel refinement, provides negligible detection improvement while degrading localization in our setting. This suggests that guided feature upsampling already resolves the spatial precision lost through patch tokenization. On downstream archery metrics, the system recovers per-image average arrow scores with a median error of 1.8\% and group centroid positions to within a median of 4.00\,mm. These results demonstrate that frozen foundation models with minimal task-specific adaptation offer a practical paradigm for dense prediction in small-data regimes.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Dev.to

Elevating Austria: Google invests in its first data center in the Alps.

Google Blog

10 AI Tools Every Developer Should Try in 2026

Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

Dev.to

Frozen Vision Transformers for Dense Prediction on Small Datasets: A Case Study in Arrow Localization

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Elevating Austria: Google invests in its first data center in the Alps.

10 AI Tools Every Developer Should Try in 2026

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer