Dynamic batching for Encoder-Decoder MT training or generation when long sequence caps the batch size [P]

Reddit r/MachineLearning / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author built a PyTorch batching sampler called dynabatch to address poor GPU utilization caused by long sequences forcing a small fixed batch size during encoder-decoder MT training.
  • The method sorts samples by token length, estimates the memory “pressure” of the hardest (longest) batch, then increases candidate batch sizes for shorter batches using a trained XGB regressor under a safety threshold.
  • The approach is targeted mainly at encoder-decoder models used for machine translation, since source length often correlates with target length, and the author cautions it is not ideal for decoder-only models.
  • In the author’s benchmarks, dynabatch improves training throughput by about 3.3x versus fixed batch sizing, while reported gains on a Collab T4 generation benchmark are much smaller (around 1.06x–1.21x).
  • Because the memory predictor is empirical and may be inaccurate for some models/tokenizers, the implementation includes a fallback that triggers when it overestimates and would otherwise cause OOM.

I built a small pytorch sampler called dynabatch after facing this specific batching issue while fine tuning a NLLB-200 600M model.

Training on RTX 5090, the largest fixed batch size I could use was 8, any bigger leads to OOM. While training and monitoring using nvidia-smi , it looked like only a few batches were actually stressing the GPU. A lot of the time utilization was much lower. My guess was that fixed batch size was being dictated by the longests source/target examples, while the shorter examples probably had room for more samples per batch.

So I tried to make the batch size change as the sequence lengths changed. The gist of the idea is:

  • sort examples by token length, longest first
  • treat the first batch as “this is the hardest batch that fits”
  • for later, shorter batches, try larger candidate batch sizes
  • use a small XGB regressor to predict memory pressure relative to that first batch
  • pick the largest candidate that stays under a safety threshold

This is mostly meant for encoder-decoder models, especially for MT where source length is often a useful proxy for target length. I would not use this as my first tool for decoder-only models. I think sequence packing is a better winner.

In my training benchmark, this gave about 3.3x throughput improvement over fixed batch training. The number is true to my setup, but I do not think it should be read as a general claim. On collab T4 generation benchmark, the gain was only around 1.06x - 1.21x

The regressor is also empirical, it was trained from measured GPU memory usage, so it can be wrong sometimes, and might behave a little differently for some models/tokenizer. But I have added a fallback when it overestimates and throw OOM. (Also added the regressor training notebooks for anyone interested)

So, honestly I think this is a very niche tool especially in the decoder-only era, but I hope this helps for people who are training/generating using encoder-decoder MT models.

Repo: https://github.com/bendangnuksung/dynabatch
PyPI: https://pypi.org/project/dynabatch/

submitted by /u/Leather_Loan5314
[link] [comments]