Spike Hijacking in Late-Interaction Retrieval

arXiv cs.LG / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Late-interaction retrieval models typically use hard MaxSim (winner-take-all) pooling to aggregate token/patch similarities, and the paper argues this can bias training dynamics structurally.
  • The study analyzes gradient routing in MaxSim-based retrieval and shows that MaxSim causes significantly higher patch-level gradient concentration than smoother aggregation methods like Top-k pooling or softmax.
  • In synthetic in-batch contrastive experiments, the authors find a sparsity–robustness tradeoff: while sparse routing can improve early discrimination, MaxSim becomes more sensitive to document length.
  • Document-length sweeps on a real-world multi-vector retrieval benchmark confirm that MaxSim degrades more sharply than mild smoothing alternatives, indicating brittleness linked to hard max pooling.
  • The work motivates replacing hard max pooling with more principled pooling/aggregation strategies to improve robustness in multi-vector late-interaction systems.

Abstract

Late-interaction retrieval models rely on hard maximum similarity (MaxSim) to aggregate token-level similarities. Although effective, this winner-take-all pooling rule may structurally bias training dynamics. We provide a mechanistic study of gradient routing and robustness in MaxSim-based retrieval. In a controlled synthetic environment with in-batch contrastive training, we demonstrate that MaxSim induces significantly higher patch-level gradient concentration than smoother alternatives such as Top-k pooling and softmax aggregation. While sparse routing can improve early discrimination, it also increases sensitivity to document length: as the number of document patches grows, MaxSim degrades more sharply than mild smoothing variants. We corroborate these findings on a real-world multi-vector retrieval benchmark, where controlled document-length sweeps reveal similar brittleness under hard max pooling. Together, our results isolate pooling-induced gradient concentration as a structural property of late-interaction retrieval and highlight a sparsity-robustness tradeoff. These findings motivate principled alternatives to hard max pooling in multi-vector retrieval systems.