Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

arXiv cs.CL / 3/11/2026

Ideas & Deep AnalysisModels & Research

Read original →

共有:

Key Points

The paper identifies a systematic attention collapse in the BLOOM family of transformer language models using ALiBi positional encoding, where 31-44% of attention heads focus almost entirely on the beginning-of-sequence token.
This attention collapse appears consistently across different model scales (560M to 7.1B parameters) and is linked to ALiBi's slope schedule imposing steep distance penalties.
The authors propose a surgical reinitialization method involving targeted Q/K/V reinitialization and gradient-masked freezing of other parameters, effectively recovering 98.7% of operational attention heads in BLOOM-1b7 using a single consumer GPU.
Experiments show that reinitialization, rather than training corpus content, drives recovery and prompt both functional improvements and late-stage degradations under noisy training.
Further reinitialization of mostly healthy heads can transiently outperform the original model by 25% on training perplexity, suggesting that pretrained attention configurations might be suboptimal local minima; the work is supported by open-source code and tools.

Computer Science > Computation and Language

arXiv:2603.09616 (cs)

[Submitted on 10 Mar 2026]

Title:Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

Authors:Palmer Schallon

View a PDF of the paper titled Surgical Repair of Collapsed Attention Heads in ALiBi Transformers, by Palmer Schallon

View PDF HTML (experimental)

Abstract:We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization -- not corpus content -- drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.

Comments:
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2603.09616 [cs.CL]
	(or arXiv:2603.09616v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2603.09616 Focus to learn more arXiv-issued DOI via DataCite

Submission history

From: Jason Schallon [view email]
[v1] Tue, 10 Mar 2026 12:57:49 UTC (1,258 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Surgical Repair of Collapsed Attention Heads in ALiBi Transformers, by Palmer Schallon

View PDF
HTML (experimental)
TeX Source

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2026-03

Change to browse by:

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

Key Points

Computer Science > Computation and Language

Title:Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer