Here's how my LLM's decoder block changed while training on 5B tokens

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The author describes an experimental LLM training run in which they replaced a transformer’s MLP decoder blocks with a discrete, lower-dimensional spline manifold geometry approach from their K‑Splanifolds paper.
  • They report monitoring the model across training on 5B tokens and show how layer 96 out of 128 evolves visually during that process.
  • They state that the resulting ~18M-parameter model performs surprisingly well and that training loss continues to decrease.
  • The author plans to keep training until there are signs of loss stagnation, using this as an informal validation of the modified decoder design.
Here's how my LLM's decoder block changed while training on 5B tokens

I'm monitoring an experimental model's ongoing training. I replaced the MLP decoders of a traditional transformer with discrete lower-dimensional spline manifold geometry described in my K-Splanifolds paper. The image shows how layer 96 of 128 developed over 5B tokens trained. The 18M model works surprisingly well and loss is reducing, so I'll continue to train it until I see evidence it is stagnating. Just thought you all might find this look at its development interesting.

submitted by /u/1ncehost
[link] [comments]