BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
arXiv cs.CL / 4/8/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- BOSCH is a training-free, black-box method for selecting attention heads when converting an LLM’s attention mechanism to sliding-window attention (SWA) for reduced KV-cache and latency.
- The paper argues that existing hybridization approaches are limited because layer-level methods ignore head routing within layers, while static head rankings become entangled and may change behavior after hybridization.
- BOSCH formulates head selection as Large Neighborhood Search and solves it via three subproblems: detecting layer importance with small-budget probes, assigning an adaptive SWA ratio per layer, and optimizing head choices within grouped ratio buckets.
- Experiments on four LLMs (1.7B–30B parameters) across four SWA ratios show BOSCH beats both layer-level heuristics and six strong static head-selection baselines, especially at higher SWA ratios.
- Under continual pretraining, BOSCH more quickly and more fully recovers original long-context performance, and its selected heads meaningfully change across SWA ratios, highlighting the need for ratio-specific head selection.




