Architecture Determines Observability in Transformers

arXiv cs.LG / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Autoregressive transformers can make confident mistakes, and whether activation-based monitoring can detect those errors depends on whether the model’s architecture and training recipe preserves an internal signal not exposed by output confidence.
  • The study operationalizes “observability” as the linear readability of per-token decision quality from frozen mid-layer activations after controlling for max-softmax confidence and activation norm, finding that confidence alone absorbs 57.7% of the raw probe signal on average.
  • Observability is not uniform across transformer designs: in Pythia, a specific 24-layer/16-head configuration collapses to low partial correlation (~0.10) across parameter and dataset variants, while other configurations show a higher, healthier band (~0.21–0.38).
  • The collapse emerges during training rather than being absent at initialization, with some configurations generating the signal early but later “erasing” it even as predictive loss keeps improving.
  • Cross-model and cross-recipe results show persistent but architecture-dependent collapse patterns; some “observer” models trained on WikiText can still catch errors missed by confidence during downstream QA, implying architecture selection is a key monitoring decision.

Abstract

Autoregressive transformers make confident errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. This preservation is determined by architecture and training recipe. We define observability as the linear readability of per-token decision quality from frozen mid-layer activations after controlling for max-softmax confidence and activation norm. The correction is essential. Confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families. Observability is not a generic property of transformers. In Pythia's controlled suite, every tested run with the 24-layer, 16-head configuration collapses to rho_partial ~0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band from 0.21 to 0.38. The output-controlled residual collapses at the same points, and neither tested nonlinear probes nor layer sweeps recover healthy-range signal. Checkpoint dynamics show the collapse is emergent during training. Both configurations at matched hidden dimension form the signal at the earliest measured checkpoint, but training erases it in the (24L, 16H) class while predictive loss continues improving. Across independent recipes the collapse map changes but the phenomenon persists. Qwen 2.5 and Llama differ by 2.9x at matched 3B scale with probe seed distributions that do not overlap, while Mistral 7B preserves observability where Llama 3.1 8B collapses despite similar broad architecture. A WikiText-trained observer transfers to downstream QA without training on those tasks, catching errors confidence misses. At 20% flag rate, its exclusive catch rate is 10.9-13.4% of all errors in seven of nine model-task cells. Architecture selection is a monitoring decision.