[D] Make. Big. Batch. Size.

Reddit r/MachineLearning / 4/3/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

A Reddit user reports training an RWKV v6 language model (~192.8M parameters) on an RTX 4050 and finding that increasing gradient accumulation (thus effective batch size) significantly improved perplexity (PPL).
They observed little or no improvement when using small effective batch sizes (e.g., batch_size=2 with gradient_accumulation=4; effective_batch=8), even after adjusting learning rate and time_decay-related LR.
After increasing gradient_accumulation substantially (e.g., to 32 then 64), PPL dropped much more dramatically, reaching about 20 PPL after several hours and continuing improvements over multiple days.
The post frames this as practical training advice that may apply to training generative language models from scratch as well as fine-tuning, based on the author’s experiments.
The author presents the guidance as personal experience rather than a formal study, suggesting batch sizing/throughput tradeoffs can be decisive for convergence behavior.

It's something between vent and learning.

I tried training RWKV v6 model by my own code on my RTX 4050. I trained over 50k steps on batch_size=2 and gradient_accumulation=4 (effective_batch=2*4=8). It got up to 50 PPL (RWKV v6, ~192.8M model) and it just won't get less, I changed lr, time_decay lr (RWKV attention replacement) etc - but it got only worse or didn't changed anything at all.. and then... I just tried setting gradient_accumulation to 32. After one "epoch" (it's pseudo-epochs in my code, equals to 10k steps) it got to 40 PPL... Then I tried changing to 64 and tried 3 epochs. My PPL dropped up to freaking 20 PPL. I trained this model for over a 4 FULL DAYS non-stop and only when I did all that stuff, after like 2-3 hours of training with effective_batch=64 (and 128) I got PPL drop THAT crazy..

IDK is this post is low-effort, but it's still just my advice for everyone who trains.. at least generative LM from scratch (and it's useful in fine-tuning too !)..

submitted by /u/Lines25
[link] [comments]