Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
arXiv cs.CV / 3/25/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies a limitation in training-free open-vocabulary semantic segmentation methods that use sliding-window inference: independent window processing causes semantic discrepancies across windows.
- It proposes Global-Local Aligned CLIP (GLA-CLIP), which extends CLIP key-value tokens to enable information exchange across all windows instead of restricting attention to local window tokens.
- The authors address a “window bias” problem where outer-window tokens receive less attention by introducing a proxy anchor that aggregates highly query-relevant tokens from all windows as a unified semantic reference.
- To improve robustness for small objects, GLA-CLIP adds a dynamic normalization scheme that scales and thresholds attention based on object scale.
- The method is reported to work as a plug-in enhancement for existing approaches, broadening receptive fields, and is supported by extensive experiments with released code.
Related Articles
AgentDesk vs Hiring Another Consultant: A Cost Comparison
Dev.to
"Why Your AI Agent Needs a System 1"
Dev.to
When should we expect TurboQuant?
Reddit r/LocalLLaMA
AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia
Dev.to
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Dev.to