BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

arXiv cs.CL / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • BOSCH is a training-free, black-box method for selecting attention heads when converting an LLM’s attention mechanism to sliding-window attention (SWA) for reduced KV-cache and latency.
  • The paper argues that existing hybridization approaches are limited because layer-level methods ignore head routing within layers, while static head rankings become entangled and may change behavior after hybridization.
  • BOSCH formulates head selection as Large Neighborhood Search and solves it via three subproblems: detecting layer importance with small-budget probes, assigning an adaptive SWA ratio per layer, and optimizing head choices within grouped ratio buckets.
  • Experiments on four LLMs (1.7B–30B parameters) across four SWA ratios show BOSCH beats both layer-level heuristics and six strong static head-selection baselines, especially at higher SWA ratios.
  • Under continual pretraining, BOSCH more quickly and more fully recovers original long-context performance, and its selected heads meaningfully change across SWA ratios, highlighting the need for ratio-specific head selection.

Abstract

Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridization schemes are typically defined either at the layer level (e.g., interleaving) or at the head level via static rankings from local to global. Layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement: a head's local/global behavior can change after hybridization. We propose BOSCH, Black-box Binary Optimization for Short-context Head Selection, a training-free method that formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (i) layer-importance detection via small-budget black-box probes, (ii) adaptive per-layer SWA-ratio assignment based on these sensitivities, and (iii) grouped head-level optimization within ratio buckets. Extensive experiments on 4 LLMs ranging from 1.7B to 30B parameters, across 4 SWA ratios, show that BOSCH consistently outperforms layer-level heuristics and 6 strong static head-level methods, with larger gains at higher SWA ratios. Under continual pretraining, BOSCH recover original long-context performance faster and to a higher level. Analysis of the selected heads reveals substantial turnover for BOSCH across different SWA ratios, underscoring the importance of performing head-level selection for each target ratio rather than relying on fixed locality rankings.