[D] Make. Big. Batch. Size.

Reddit r/MachineLearning / 4/3/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • A Reddit user reports training an RWKV v6 language model (~192.8M parameters) on an RTX 4050 and finding that increasing gradient accumulation (thus effective batch size) significantly improved perplexity (PPL).
  • They observed little or no improvement when using small effective batch sizes (e.g., batch_size=2 with gradient_accumulation=4; effective_batch=8), even after adjusting learning rate and time_decay-related LR.
  • After increasing gradient_accumulation substantially (e.g., to 32 then 64), PPL dropped much more dramatically, reaching about 20 PPL after several hours and continuing improvements over multiple days.
  • The post frames this as practical training advice that may apply to training generative language models from scratch as well as fine-tuning, based on the author’s experiments.
  • The author presents the guidance as personal experience rather than a formal study, suggesting batch sizing/throughput tradeoffs can be decisive for convergence behavior.

It's something between vent and learning.

I tried training RWKV v6 model by my own code on my RTX 4050. I trained over 50k steps on batch_size=2 and gradient_accumulation=4 (effective_batch=2*4=8). It got up to 50 PPL (RWKV v6, ~192.8M model) and it just won't get less, I changed lr, time_decay lr (RWKV attention replacement) etc - but it got only worse or didn't changed anything at all.. and then... I just tried setting gradient_accumulation to 32. After one "epoch" (it's pseudo-epochs in my code, equals to 10k steps) it got to 40 PPL... Then I tried changing to 64 and tried 3 epochs. My PPL dropped up to freaking 20 PPL. I trained this model for over a 4 FULL DAYS non-stop and only when I did all that stuff, after like 2-3 hours of training with effective_batch=64 (and 128) I got PPL drop THAT crazy..

IDK is this post is low-effort, but it's still just my advice for everyone who trains.. at least generative LM from scratch (and it's useful in fine-tuning too !)..

submitted by /u/Lines25
[link] [comments]