235M param LLM from scratch on a single RTX 5080

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A developer built a 235M-parameter transformer language model (“Plasma 1.0”) entirely from scratch in PyTorch, training all weights on raw text without using any pretrained checkpoints or Hugging Face downloads.
  • The model is designed with LLaMA-style architecture details including GQA (16 query heads / 4 KV heads), SwiGLU FFN, RoPE, RMSNorm pre-norm, and tied embeddings, using a 32k SentencePiece BPE vocabulary.
  • Training was performed on a single consumer RTX 5080 using bf16 mixed precision and gradient checkpointing, with training on roughly 5B tokens at a sequence length of 1024.
  • The author implemented a full custom pretraining pipeline: data sourcing (FineWeb-Edu, Wikipedia, StackExchange, code, ArXiv), quality/toxicity filtering, MinHash deduplication, custom tokenizer training, domain-weighted mixing, and instruction tuning via loss masking.
  • Plasma 1.1 is currently training (500M parameters) with plans for improved multi-turn behavior and a larger vocabulary via byte fallback, and the repository is shared for questions and replication.

Hey everyone,

Been working on this for a while and figured I'd share it here too. I made a small transformer language model completely from scratch in PyTorch. No pretrained weights, no HuggingFace downloads. Every parameter was trained from raw text on a single consumer GPU.

Current release is Plasma 1.0 (235M params, 18 layers, hidden size 1024). LLaMA-style: GQA with 16 query heads and 4 KV heads (head_dim 64), SwiGLU FFN with 2816 intermediate dim, RoPE with theta 10000, RMSNorm pre-norm, tied embeddings. 32k SentencePiece BPE vocab. bf16 mixed precision with gradient checkpointing to fit on a 5080, trained ~5B tokens at seq len 1024.

I also wrote the full pipeline myself:

  • Data from FineWeb-Edu, Wikipedia, StackExchange, code, and ArXiv
  • Quality and toxicity filtering
  • MinHash deduplication
  • Custom SentencePiece tokenizer
  • Domain-weighted data mixing
  • Pretraining and instruction tuning with loss masking so it only learns from assistant tokens

Some sample outputs after instruct tuning:

You: When was World War 1? 1386.ai: World War I began on June 26, 1914.

You: What is a steak made of? 1386.ai: A steak can be made from various types of meat, including beef.

It's obviously not competing with Llama 3. There are hallucinations, odd outputs, and a pretty hard ceiling at this scale. But doing it this way taught me way more than just fine-tuning a larger model would have.

Plasma 1.1 is currently training (500M params), aiming for better multi-turn and a larger vocab with byte fallback.

Repo: github.com/eb1386/1386.ai

Happy to answer any questions about the pipeline or architecture choices.

submitted by /u/ExcellentTip9926
[link] [comments]