Graph spectral analysis (Fiedler value + Scheffer CSD indicators) predicts grokking 21k steps before loss function - five reproducible experiments [R]

Reddit r/MachineLearning / 5/19/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The article proposes monitoring neural network weight-graph topology during training by combining the Fiedler value (second-smallest Laplacian eigenvalue) with Scheffer-style critical slowing down indicators.
  • Across five reproducible CPU experiments (under 24 hours), the method is reported to detect “grokking” about 21,000 steps before test accuracy visibly changes.
  • It claims grokking and catastrophic forgetting exhibit different spectral/structural signatures, enabling classification based on indicator behavior (reported slope differences per step).
  • The authors report that structurally guided interventions and compatibility-scored preemptive curricula can dramatically improve knowledge retention and accelerate grokking (e.g., strong retention rates and up to 48× acceleration across sequential tasks).
  • Experiments were limited to toy settings (modular arithmetic with 2-layer MLPs and a 1-layer transformer for sequence prediction), and scaling to production architectures remains unvalidated, with limitations discussed in the paper.

I've been applying the Fiedler value (second-smallest eigenvalue of the weight graph Laplacian) combined with Scheffer critical slowing down indicators to monitor neural network topology during training.

Five experiments, all reproducible on CPU in under 24 hours:

  1. Detection: lambda-2 detects approaching grokking 21,000 steps before test accuracy moves
  2. Classification: grokking and catastrophic forgetting have distinct structural fingerprints (slope 0.00128 vs 0.00471/step)
  3. Steering: structurally-guided intervention preserves 91.7% of knowledge vs 2.6% unsteered
  4. Compounding: three sequential tasks, 100%/100%/97.5% retention, 48x grokking acceleration across tasks
  5. Preemptive curriculum: compatibility scoring ranks task disruption risk correctly, bridging preserves 100% vs 0% direct

Tested on 2-layer MLPs (modular arithmetic) and 1-layer transformer (sequence prediction). Honest limitations section in the paper. These are toy tasks and scaling to production architectures is unvalidated.

The approach comes from complex systems science (Scheffer's early warning indicators for critical transitions) applied to weight graphs rather than ecosystems or financial markets.

Code and paper: https://github.com/EssexRich/neural_si_validation

Happy to discuss the maths, the experimental design, or the limitations.

submitted by /u/RichBenf
[link] [comments]