Trained a 125M LM from scratch instead of fine-tuning GPT-2 — releasing weights + SFT framework for others to build on

Reddit r/LocalLLaMA / 4/14/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author trained a 12-layer, ~125M parameter causal language model fully from scratch using a custom 16k BPE tokenizer and reports ~6.19 validation perplexity on WikiText-103 after ~92k steps.
  • They released two Hugging Face checkpoints: a continuation/base LM and an instruction/conversational variant fine-tuned with LoRA (rank 8) on DailyDialog using completion-only masked loss.
  • The release is positioned as an extensible “small-scale base model stack” aimed at letting others modify tokenization, instruction tuning, and domain adaptation without heavy multi-GPU infrastructure.
  • Alongside the weights, they published an SFT framework on GitHub so others can fine-tune their own variants without rebuilding the training pipeline.
  • The author plans to scale the same architecture to ~390M and is seeking advice on instruction datasets that perform well below ~500M parameters.

Trained a 125M LM from scratch (custom tokenizer) + released instruct checkpoint and SFT framework so others can fine-tune their own variants

I’ve been experimenting with training small language models fully from scratch (no GPT-2 init, no borrowed tokenizer) and wanted to share something others here might be able to build on.

I trained a 12-layer 125M parameter causal LM using a custom 16k BPE tokenizer on WikiText-103 + TinyStories. Training ran ~92k steps and reached ~6.19 validation perplexity on WikiText-103.

Then I trained a conversational variant using LoRA (rank 8) on DailyDialog (~87k examples) with completion-only masked loss and merged the adapter into a standalone checkpoint.

Released both here:

Base model (continuation LM):

https://huggingface.co/MaheshwariSujal/librarian-base-130m

Instruct variant (dialogue tuned):

https://huggingface.co/MaheshwariSujal/Librarian-Instruct-130m

These obviously aren’t competing with modern 1B+ instruct models. The goal was to create a clean small-scale base model stack that people can actually modify.

I’m also releasing the SFT framework I used so anyone can fine-tune their own variants without rebuilding the pipeline:

https://github.com/sujal-maheshwari2004/Librarian-SFT

If someone wants a lightweight (~125M) base model for experimenting with instruction tuning, tokenizer changes, or domain adaptation without needing multi-GPU infra, this should be a reasonable starting point.

Planning to scale the same architecture to ~390M next. If anyone has suggestions for strong instruction datasets that work well below ~500M params I’d appreciate pointers.

submitted by /u/Kill_Streak308
[link] [comments]