DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)

arXiv cs.LG / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard transformer practitioners must pre-select architecture size (e.g., number of attention heads, depth, width), often leading to systematically oversized models that can be pruned after training without losing performance.
  • It proposes DDCL-INCRT, a self-organising transformer that learns its own structure during training by combining DDCL (prototype-based deep dual competitive learning for feedforward blocks) with INCRT (incremental head growth).
  • DDCL uses a dictionary of learned prototype vectors that automatically spread according to the training objective, while INCRT starts with one attention head and adds new heads only when uncovered directional information surpasses a threshold.
  • Theoretical results show that the prototype separation and incremental head addition reinforce one another, producing a hierarchy of heads by representational granularity and yielding a proved unique, minimal architecture sufficient for the task under stated assumptions.
  • The authors provide formal guarantees for stability, convergence, and pruning safety, aiming to replace manual architecture design with a derivation-from-training approach.

Abstract

Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives.