235M param LLM from scratch on a single RTX 5080

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A developer built a 235M-parameter transformer language model (“Plasma 1.0”) entirely from scratch in PyTorch, training all weights on raw text without using any pretrained checkpoints or Hugging Face downloads.
The model is designed with LLaMA-style architecture details including GQA (16 query heads / 4 KV heads), SwiGLU FFN, RoPE, RMSNorm pre-norm, and tied embeddings, using a 32k SentencePiece BPE vocabulary.
Training was performed on a single consumer RTX 5080 using bf16 mixed precision and gradient checkpointing, with training on roughly 5B tokens at a sequence length of 1024.
The author implemented a full custom pretraining pipeline: data sourcing (FineWeb-Edu, Wikipedia, StackExchange, code, ArXiv), quality/toxicity filtering, MinHash deduplication, custom tokenizer training, domain-weighted mixing, and instruction tuning via loss masking.
Plasma 1.1 is currently training (500M parameters) with plans for improved multi-turn behavior and a larger vocabulary via byte fallback, and the repository is shared for questions and replication.

Hey everyone,

Been working on this for a while and figured I'd share it here too. I made a small transformer language model completely from scratch in PyTorch. No pretrained weights, no HuggingFace downloads. Every parameter was trained from raw text on a single consumer GPU.

Current release is Plasma 1.0 (235M params, 18 layers, hidden size 1024). LLaMA-style: GQA with 16 query heads and 4 KV heads (head_dim 64), SwiGLU FFN with 2816 intermediate dim, RoPE with theta 10000, RMSNorm pre-norm, tied embeddings. 32k SentencePiece BPE vocab. bf16 mixed precision with gradient checkpointing to fit on a 5080, trained ~5B tokens at seq len 1024.

I also wrote the full pipeline myself:

Data from FineWeb-Edu, Wikipedia, StackExchange, code, and ArXiv
Quality and toxicity filtering
MinHash deduplication
Custom SentencePiece tokenizer
Domain-weighted data mixing
Pretraining and instruction tuning with loss masking so it only learns from assistant tokens

Some sample outputs after instruct tuning:

You: When was World War 1? 1386.ai: World War I began on June 26, 1914.

You: What is a steak made of? 1386.ai: A steak can be made from various types of meat, including beef.

It's obviously not competing with Llama 3. There are hallucinations, odd outputs, and a pretty hard ceiling at this scale. But doing it this way taught me way more than just fine-tuning a larger model would have.

Plasma 1.1 is currently training (500M params), aiming for better multi-turn and a larger vocab with byte fallback.

Repo: github.com/eb1386/1386.ai

Happy to answer any questions about the pipeline or architecture choices.

submitted by /u/ExcellentTip9926
[link] [comments]

Black Hat USA

AI Business

Free AI Detection app designed specifically for Social Media posts

Reddit r/artificial

Why Your Production LLM Prompt Keeps Failing (And How to Diagnose It in 4 Steps)

Dev.to

Explainable Causal Reinforcement Learning for satellite anomaly response operations under multi-jurisdictional compliance

Dev.to

How to Build AI-Powered Automation Workflows for Small Businesses — A Developer'

Dev.to

235M param LLM from scratch on a single RTX 5080

Key Points

Related Articles

Black Hat USA

Free AI Detection app designed specifically for Social Media posts

Why Your Production LLM Prompt Keeps Failing (And How to Diagnose It in 4 Steps)

Explainable Causal Reinforcement Learning for satellite anomaly response operations under multi-jurisdictional compliance

How to Build AI-Powered Automation Workflows for Small Businesses — A Developer'

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer