Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding

arXiv cs.CV / 5/5/2026

📰 NewsModels & Research

Key Points

  • The paper introduces Dino-NestedUNet, which pairs a pre-trained DINOv3 vision foundation encoder with a “Nested Dense Decoder” designed for more accurate boundary reconstruction in pathology tumor bulk segmentation.
  • It argues that prior approaches that freeze VFMs and attach lightweight decoders suffer from capacity mismatch, leading to poorer boundary fidelity for infiltrative tumors.
  • Dino-NestedUNet replaces sparse skip connections and simple upsampling with a dense grid of intermediate pathways to support continuous feature reuse and multi-scale recalibration during decoding.
  • Experiments on three histopathology cohorts (CHTN, OSU, CAMELYON16) show consistent gains over UNet++ and standard Dino-UNet variants, with especially strong benefits under cross-domain shift.
  • The model also demonstrates promising external generalization via zero-shot testing (train on CHTN, test on TIGER WSIBULK and OSU CRC) without fine-tuning, highlighting the value of dense decoding for foundation-encoder segmentation tasks.

Abstract

Vision foundation models (VFMs), such as DINOv3, provide rich semantic representations that are promising for computational pathology. However, many current adaptations pair frozen VFMs with lightweight decoders, creating a capacity mismatch that often limits boundary fidelity for infiltrative tumor bulk segmentation. This paper presents Dino-NestedUNet, a framework that couples a pre-trained DINOv3 encoder with a Nested Dense Decoder. Instead of sparse skip connections and linear upsampling, the proposed decoder forms a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures during reconstruction. We evaluate Dino-NestedUNet on three histopathology cohorts (multi-center CHTN, institutional OSU, and CAMELYON16) and observe consistent improvements over UNet++ and standard Dino-UNet variants, particularly under cross-domain shift. To further assess external generalization, we perform zero-shot evaluation by training on CHTN and directly testing on unseen TIGER WSIBULK and OSU CRC cohorts without fine-tuning. These results suggest that dense decoding is a key ingredient for unlocking foundation encoders in boundary-sensitive pathology segmentation.