Trained a 125M LM from scratch instead of fine-tuning GPT-2 — releasing weights + SFT framework for others to build on

Reddit r/LocalLLaMA / 4/14/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author trained a 12-layer, ~125M parameter causal language model fully from scratch using a custom 16k BPE tokenizer and reports ~6.19 validation perplexity on WikiText-103 after ~92k steps.
They released two Hugging Face checkpoints: a continuation/base LM and an instruction/conversational variant fine-tuned with LoRA (rank 8) on DailyDialog using completion-only masked loss.
The release is positioned as an extensible “small-scale base model stack” aimed at letting others modify tokenization, instruction tuning, and domain adaptation without heavy multi-GPU infrastructure.
Alongside the weights, they published an SFT framework on GitHub so others can fine-tune their own variants without rebuilding the training pipeline.
The author plans to scale the same architecture to ~390M and is seeking advice on instruction datasets that perform well below ~500M parameters.

Trained a 125M LM from scratch (custom tokenizer) + released instruct checkpoint and SFT framework so others can fine-tune their own variants

I’ve been experimenting with training small language models fully from scratch (no GPT-2 init, no borrowed tokenizer) and wanted to share something others here might be able to build on.

I trained a 12-layer 125M parameter causal LM using a custom 16k BPE tokenizer on WikiText-103 + TinyStories. Training ran ~92k steps and reached ~6.19 validation perplexity on WikiText-103.

Then I trained a conversational variant using LoRA (rank 8) on DailyDialog (~87k examples) with completion-only masked loss and merged the adapter into a standalone checkpoint.

Released both here:

Base model (continuation LM):

https://huggingface.co/MaheshwariSujal/librarian-base-130m

Instruct variant (dialogue tuned):

https://huggingface.co/MaheshwariSujal/Librarian-Instruct-130m

These obviously aren’t competing with modern 1B+ instruct models. The goal was to create a clean small-scale base model stack that people can actually modify.

I’m also releasing the SFT framework I used so anyone can fine-tune their own variants without rebuilding the pipeline:

https://github.com/sujal-maheshwari2004/Librarian-SFT

If someone wants a lightweight (~125M) base model for experimenting with instruction tuning, tokenizer changes, or domain adaptation without needing multi-GPU infra, this should be a reasonable starting point.

Planning to scale the same architecture to ~390M next. If anyone has suggestions for strong instruction datasets that work well below ~500M params I’d appreciate pointers.

submitted by /u/Kill_Streak308
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Trained a 125M LM from scratch instead of fine-tuning GPT-2 — releasing weights + SFT framework for others to build on

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer