Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding

arXiv cs.CL / 4/24/2026

💬 OpinionModels & Research

Key Points

  • A pre-registered study tests an earlier claim that the K-way energy probe in predictive coding networks reduces to a monotone function of the log-softmax margin, focusing on how sensitive the reduction is to removing cross-entropy (CE).
  • When standard predictive coding is trained without CE (using MSE instead), the probe no longer matches the softmax-based behavior: the probe remains below softmax with a statistically significant negative gap across 10 CIFAR-10 seeds.
  • In bidirectional predictive coding (bPC), the probe exceeds softmax for all seeds, but the study’s manipulation check finds bPC does not meaningfully increase latent movement at the matched scale.
  • Removing CE alone roughly halves the probe–softmax gap, indicating CE is a key “load-bearing” component; CE training also yields much larger output logit norms than MSE or bPC training.
  • Temperature-scaling ablations further decompose the effect: about 66% of the probe–softmax gap comes from logit-scale effects that temperature rescaling can remove, while about 34% reflects a scale-invariant ranking advantage from CE-trained representations.

Abstract

Cacioli (2026) showed that the K-way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log-softmax margin. The reduction rests on five assumptions, including cross-entropy (CE) at the output and effectively feedforward inference dynamics. This pre-registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang & Bogacz, 2025). Across 10 seeds on CIFAR-10 with a matched 2.1M-parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = -0.082, p < 10^-6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre-registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe-softmax gap (Delta_MSE = -0.037 vs Delta_stdPC = -0.082). CE is a major empirically load-bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post-hoc temperature scaling ablation decomposes the probe-softmax gap into two components: approximately 66% is attributable to logit-scale effects removable by temperature rescaling, and approximately 34% reflects a scale-invariant ranking advantage of CE-trained representations. We use "metacognitive" operationally to denote Type-2 discrimination of a readout over its own Type-1 correctness, not to imply human-like introspective access.