AI Navigate

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

arXiv cs.CL / 3/11/2026

Ideas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a systematic attention collapse in the BLOOM family of transformer language models using ALiBi positional encoding, where 31-44% of attention heads focus almost entirely on the beginning-of-sequence token.
  • This attention collapse appears consistently across different model scales (560M to 7.1B parameters) and is linked to ALiBi's slope schedule imposing steep distance penalties.
  • The authors propose a surgical reinitialization method involving targeted Q/K/V reinitialization and gradient-masked freezing of other parameters, effectively recovering 98.7% of operational attention heads in BLOOM-1b7 using a single consumer GPU.
  • Experiments show that reinitialization, rather than training corpus content, drives recovery and prompt both functional improvements and late-stage degradations under noisy training.
  • Further reinitialization of mostly healthy heads can transiently outperform the original model by 25% on training perplexity, suggesting that pretrained attention configurations might be suboptimal local minima; the work is supported by open-source code and tools.

Computer Science > Computation and Language

arXiv:2603.09616 (cs)
[Submitted on 10 Mar 2026]

Title:Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

View a PDF of the paper titled Surgical Repair of Collapsed Attention Heads in ALiBi Transformers, by Palmer Schallon
View PDF HTML (experimental)
Abstract:We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization -- not corpus content -- drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.
Comments:
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2603.09616 [cs.CL]
  (or arXiv:2603.09616v1 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2603.09616
Focus to learn more
arXiv-issued DOI via DataCite

Submission history

From: Jason Schallon [view email]
[v1] Tue, 10 Mar 2026 12:57:49 UTC (1,258 KB)
Full-text links:

Access Paper:

Current browse context:
cs.CL
< prev   |   next >
Change to browse by:
cs

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos

Demos

Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers

Recommenders and Search Tools

Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.