DiveUp: Learning Feature Upsampling from Diverse Vision Foundation Models
arXiv cs.CV / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- DiveUp proposes a multi-VFM relational guidance framework that uses diverse vision foundation models as experts to regularize feature upsampling and prevent propagation of inaccurate spatial structures from any single model.
- It introduces a universal relational feature representation, the local center-of-mass field, to reconcile unaligned feature spaces across different VFMs and enable cross-model interaction.
- The framework includes a spikiness-aware selection strategy that evaluates spatial reliability and filters out high-norm artifacts, aggregating guidance only from the most reliable expert at each local region.
- DiveUp is encoder-agnostic and jointly trainable, enabling universal upsampling of features from diverse VFMs without per-model retraining.
- Experiments show state-of-the-art performance on multiple dense prediction tasks, demonstrating the effectiveness of multi-expert relational guidance, with code and models released on GitHub.




