Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents an approach to instance segmentation of “affordances” in images, aiming for accurate and localized prediction of interaction-capable regions rather than coarse saliency maps.
  • It uses Bayesian Visual Transformers with sample-based and ensemble uncertainty estimation, extracting pixel-wise epistemic and aleatoric variances at both semantic and spatial levels.
  • The authors introduce “Probability-based Mask Quality” to analyze semantic and spatial variations in probabilistic instance segmentation, improving interpretability of uncertainty outputs.
  • Experiments on the IIT-Aff dataset show that a global consensus from multiple Bayesian sub-networks improves deterministic networks, yielding a +7.4 percentage-point gain on the weighted Fβ score while also improving calibration and reducing overconfident probabilities.

Abstract

Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human-robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample-based and ensembles approaches for uncertainty estimation. We extend an attention-based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel-wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability-based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the F_{\beta}^w score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.