INCRT: An Incremental Transformer That Determines Its Own Architecture

arXiv cs.LG / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes INCRT (Incremental Transformer), which incrementally adds and prunes attention heads during training instead of fixing the transformer architecture before learning.
  • INCRT starts with a single head and grows the model only when its current structure is provably insufficient, while pruning heads shown to be redundant, guided by an online-computable geometric metric.
  • Two theoretical results are presented: homeostatic convergence to a finite minimal-and-sufficient stopping configuration, and a compressed-sensing-inspired bound relating final head count to the task’s spectral complexity.
  • Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis show head-count predictions align with observed counts within ~12%, and the resulting architectures match or exceed BERT-base on task-specific benchmarks while using fewer parameters (3–7×) and avoiding pre-training.

Abstract

Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.