Getting autoresearch running properly on an RTX 5090: what failed, what worked, and the best config we found

Reddit r/LocalLLaMA / 3/20/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The baseline autoresearch path on an RTX 5090 was severely underperforming (only a few thousand tokens/sec) despite the code running, prompting a search for a working setup.
A working approach combined re-enabling the fused optimizer compile path, using the stable SDPA / CuDNN attention path, empirically tuning TOTAL_BATCH_SIZE and TIME_BUDGET, and integrating automation into the benchmark loop.
Progression shows incremental improvements: baseline healthy run, fused optimizer compile improvement, TOTAL_BATCH_SIZE = 2**18, then TOTAL_BATCH_SIZE = 2**17 validation, culminating in a best auto-loop result with TOTAL_BATCH_SIZE = 2**17, TIME_BUDGET = 1200, and LR multiplier = 1.0.
The key lesson is that a stable batch regime and a longer training horizon—with automation treated as part of the benchmark system—outperform maximal scaling or overly aggressive small/big batch settings.

I spent time getting autoresearch running properly on an RTX 5090 / Blackwell setup and thought it might save other people some time to share what actually happened.

The short version

The initial path was badly broken. We saw extremely poor performance at first — on the order of a few thousand tok/sec and essentially useless MFU — despite the code technically “running.”

The eventual working path was:

• avoid the broken full-model compile path on this setup

• keep the good fused optimizer compile improvements where they actually helped

• use the stable SDPA / CuDNN attention path

• tune total batch and time budget empirically instead of guessing

• automate the benchmark / extract / strategize / rerun loop

What failed

A few failure modes were especially misleading:

• a path that was technically correct but catastrophically slow

• misleading MFU interpretation until the denominator was corrected for the 5090 context

• higher per-device batch settings that looked like they should help but actually made things much worse

• automation bugs around lock cleanup / completion hooks / dispatch order

In other words: there were several ways to get a run that looked alive while doing something stupid.

What helped

Real improvements came from:

• re-enabling the fused optimizer compile path

• reducing total batch from the original larger setting

• validating 2**17 as the better total batch region

• increasing time budget once the stable batch regime was found

• treating automation as part of the benchmark system, not an afterthought

Progression

A simplified progression of the useful runs:

• baseline healthy run:

• val_bpb: 1.165452

• mfu: 40.49%

• fused optimizer compile improvement:

• val_bpb: 1.155400

• mfu: 42.88%

• TOTAL_BATCH_SIZE = 2**18:

• val_bpb: 1.108381

• mfu: 43.18%

• TOTAL_BATCH_SIZE = 2**17 validation:

• val_bpb: 1.089424

• mfu: 43.03%

• best current auto-loop result:

• TOTAL_BATCH_SIZE = 2**17

• TIME_BUDGET = 1200

• LR multiplier = 1.0

• val_bpb: 0.999445

• mfu: 42.56%

• total_tokens_M: 387.8

• num_steps: 2959

Current best-known config

So far the best result is:

• TOTAL_BATCH_SIZE = 2**17

• TIME_BUDGET = 1200

• LR multiplier = 1.0

That combination beat:

• larger batch variants

• smaller 2**16 variant

• a lower-LR test

• shorter training budgets

Main lesson

For this 5090 path, the biggest lesson was that the winning configuration was not some glamorous “max everything” setup.

The better path was:

• a stable batch regime

• a longer training horizon

• and careful elimination of automation and backend mistakes

Why I’m posting this

If you are working on Blackwell / 5090 training and seeing bizarre behavior, it may not be your imagination. Some paths are simply much worse than they first appear.

The useful part of this exercise was not just finding a better benchmark number — it was finding a path that is:

• stable

• automatable

• reproducible

• and good enough to build real follow-on experiments on top of

If useful, I can also share the benchmark progression table and the automation loop structure we used to keep rerunning experiments automatically.

submitted by /u/Delicious_Rule_438
[link] [comments]