Instance-level Visual Active Tracking with Occlusion-Aware Planning

arXiv cs.CV / 4/24/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces OA-VAT, a visual active tracking system that actively controls cameras to follow 3D targets while addressing two real-world bottlenecks: distractor confusion and failure under occlusion.
OA-VAT’s training-free Instance-Aware Offline Prototype Initialization uses DINOv3-based multi-view augmented features to build discriminative instance prototypes that reduce errors from visually similar distractors.
An online tracker then enhances these prototypes and applies a confidence-aware Kalman filter to maintain stable tracking despite changes in appearance and motion.
For occlusion recovery, OA-VAT adds an Occlusion-Aware Trajectory Planner trained on the new Planning-20k dataset, which uses conditional diffusion to generate obstacle-avoiding paths, achieving strong results including 0.93 average SR in UnrealCV and 35 FPS on an RTX 3090.
The reported performance gains include +2.2% SR vs TrackVLA, +12.1% CAR vs GC-VAT on real-world datasets, and 81.6% TSR on a DJI Tello drone, indicating robust real-time deployment potential.

Abstract

Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Instance-level Visual Active Tracking with Occlusion-Aware Planning

Key Points

Abstract

Related Articles

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer