Hey everyone,
Been working on this for a while and figured I'd share it here too. I made a small transformer language model completely from scratch in PyTorch. No pretrained weights, no HuggingFace downloads. Every parameter was trained from raw text on a single consumer GPU.
Current release is Plasma 1.0 (235M params, 18 layers, hidden size 1024). LLaMA-style: GQA with 16 query heads and 4 KV heads (head_dim 64), SwiGLU FFN with 2816 intermediate dim, RoPE with theta 10000, RMSNorm pre-norm, tied embeddings. 32k SentencePiece BPE vocab. bf16 mixed precision with gradient checkpointing to fit on a 5080, trained ~5B tokens at seq len 1024.
I also wrote the full pipeline myself:
- Data from FineWeb-Edu, Wikipedia, StackExchange, code, and ArXiv
- Quality and toxicity filtering
- MinHash deduplication
- Custom SentencePiece tokenizer
- Domain-weighted data mixing
- Pretraining and instruction tuning with loss masking so it only learns from assistant tokens
Some sample outputs after instruct tuning:
You: When was World War 1? 1386.ai: World War I began on June 26, 1914.
You: What is a steak made of? 1386.ai: A steak can be made from various types of meat, including beef.
It's obviously not competing with Llama 3. There are hallucinations, odd outputs, and a pretty hard ceiling at this scale. But doing it this way taught me way more than just fine-tuning a larger model would have.
Plasma 1.1 is currently training (500M params), aiming for better multi-turn and a larger vocab with byte fallback.
Repo: github.com/eb1386/1386.ai
Happy to answer any questions about the pipeline or architecture choices.
[link] [comments]
