After ~3 weeks of experimentation in OpenAI's Parameter Golf competition, I wrote up why SSMs are structurally disadvantaged relative to transformers in a time- and size-constrained regime (10 min training, 16MB artifact, 25M parameters) on 8xH100s: https://mradassaad.github.io/posts/why-ssms-struggle-in-parameter-golf/
Main findings:
- SSM in_proj weights compress up to 3.26x worse than attention QKV under LZMA, directly taxing the compressed parameter budget
- Architectural wins validated at SP4096 flipped sign at SP8192 — two configs that looked like clean wins reversed direction at the target vocabulary
Also includes three kernel-level experiments on the Mamba-3 Triton kernels: a backward fusion attempt that was numerically exact but 16% slower due to SMEM pressure, a torch.compile quantizer bug that cost 5.5 mBPB, and a mixed-precision dynamics protection that recovered 0.8 mBPB at negligible size cost.
[link] [comments]



