Efficiency Follows Global-Local Decoupling

arXiv cs.CV / 3/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes ConvNeur, a two-branch architecture that decouples global reasoning from local representation to improve efficiency in vision models.
One branch uses a lightweight neural memory to aggregate global context on a compact token set, while a locality-preserving branch handles fine-grained structure, with a learned gate modulating local features by global cues.
The design achieves subquadratic scaling with image size and reduces overhead relative to fully global attention while preserving local inductive priors.
Empirical results on classification, detection, and segmentation show ConvNeur matching or surpassing similar methods at similar or lower compute, supporting the claim that efficiency follows global-local decoupling.

Abstract

Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.