Context Sensitivity Improves Human-Machine Visual Alignment

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current embedding-based ML similarity measures are often context-insensitive compared with how humans perceive objects and relationships.
It proposes a context-sensitive similarity computation method that uses neural network embeddings with the anchor image provided as simultaneous context.
Using a triplet odd-one-out task, the approach yields up to a 15% improvement in accuracy versus a context-insensitive baseline.
The gains are reported as consistent across both standard vision foundation models and models that are “human-aligned,” suggesting the benefit is broadly applicable.

Abstract

Modern machine learning models typically represent inputs as fixed points in a high-dimensional embedding space. While this approach has been proven powerful for a wide range of downstream tasks, it fundamentally differs from the way humans process information. Because humans are constantly adapting to their environment, they represent objects and their relationships in a highly context-sensitive manner. To address this gap, we propose a method for context-sensitive similarity computation from neural network embeddings, applied to modeling a triplet odd-one-out task with an anchor image serving as simultaneous context. Modeling context enables us to achieve up to a 15% improvement in odd-one-out accuracy over a context-insensitive model. We find that this improvement is consistent across both original and "human-aligned" vision foundation models.