Information Plane Analysis of Binary Neural Networks

arXiv cs.LG / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper applies information plane (IP) analysis to binary neural networks (BNNs), focusing on how mutual information (MI) can be estimated reliably despite high-dimensional deterministic representations.
  • It analyzes the finite-sample behavior of the plug-in entropy estimator and derives conditions on sample size (N) and representation dimensionality (D) where MI estimates remain trustworthy.
  • Outside the reliable regime, empirical MI estimates saturate at \(\log_2 N\), making IP trajectories largely uninformative for interpreting training dynamics.
  • Using 375 trained BNNs, the study examines whether late-stage compression phases occur and how compressed representations relate to generalization.
  • The findings indicate that late-stage compression often appears, but compressed latent representations do not consistently improve generalization; the link is strongly dependent on task, architecture, and regularization.

Abstract

Information plane (IP) analysis has been suggested to study the training dynamics of deep neural networks through mutual information (MI) between inputs, representations, and targets. However, its statistical validity is often compromised by the difficulty of estimating MI from samples of high-dimensional, deterministic representations. In this work, we perform IP analyses on binary neural networks (BNNs) where activations are discrete and MI is finite. We characterise the finite-sample behaviour of the plug-in entropy estimator and identify regimes for sample size N and representation dimensionality D under which MI estimates are reliable. Outside these regimes, we show that empirical MI estimates saturate to \log_2 N, rendering IP trajectories uninformative. Restricting attention to the reliable regime, we train 375 BNNs to investigate the existence of late-stage compression phases and the relationship between compressed representations and generalisation performance. Our results show that while late-stage compression is frequently observed, compressed latent representations do not consistently correlate with improved generalization performance. Instead, the relationship between compression and generalisation is highly dependent on task, architecture, and regularisation.