I spent time getting autoresearch running properly on an RTX 5090 / Blackwell setup and thought it might save other people some time to share what actually happened.
The short version
The initial path was badly broken. We saw extremely poor performance at first — on the order of a few thousand tok/sec and essentially useless MFU — despite the code technically “running.”
The eventual working path was:
• avoid the broken full-model compile path on this setup
• keep the good fused optimizer compile improvements where they actually helped
• use the stable SDPA / CuDNN attention path
• tune total batch and time budget empirically instead of guessing
• automate the benchmark / extract / strategize / rerun loop
What failed
A few failure modes were especially misleading:
• a path that was technically correct but catastrophically slow
• misleading MFU interpretation until the denominator was corrected for the 5090 context
• higher per-device batch settings that looked like they should help but actually made things much worse
• automation bugs around lock cleanup / completion hooks / dispatch order
In other words: there were several ways to get a run that looked alive while doing something stupid.
What helped
Real improvements came from:
• re-enabling the fused optimizer compile path
• reducing total batch from the original larger setting
• validating 2**17 as the better total batch region
• increasing time budget once the stable batch regime was found
• treating automation as part of the benchmark system, not an afterthought
Progression
A simplified progression of the useful runs:
• baseline healthy run:
• val_bpb: 1.165452
• mfu: 40.49%
• fused optimizer compile improvement:
• val_bpb: 1.155400
• mfu: 42.88%
• TOTAL_BATCH_SIZE = 2**18:
• val_bpb: 1.108381
• mfu: 43.18%
• TOTAL_BATCH_SIZE = 2**17 validation:
• val_bpb: 1.089424
• mfu: 43.03%
• best current auto-loop result:
• TOTAL_BATCH_SIZE = 2**17
• TIME_BUDGET = 1200
• LR multiplier = 1.0
• val_bpb: 0.999445
• mfu: 42.56%
• total_tokens_M: 387.8
• num_steps: 2959
Current best-known config
So far the best result is:
• TOTAL_BATCH_SIZE = 2**17
• TIME_BUDGET = 1200
• LR multiplier = 1.0
That combination beat:
• larger batch variants
• smaller 2**16 variant
• a lower-LR test
• shorter training budgets
Main lesson
For this 5090 path, the biggest lesson was that the winning configuration was not some glamorous “max everything” setup.
The better path was:
• a stable batch regime
• a longer training horizon
• and careful elimination of automation and backend mistakes
Why I’m posting this
If you are working on Blackwell / 5090 training and seeing bizarre behavior, it may not be your imagination. Some paths are simply much worse than they first appear.
The useful part of this exercise was not just finding a better benchmark number — it was finding a path that is:
• stable
• automatable
• reproducible
• and good enough to build real follow-on experiments on top of
If useful, I can also share the benchmark progression table and the automation loop structure we used to keep rerunning experiments automatically.
[link] [comments]




