Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

Reddit r/MachineLearning / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The report explains, based on ~3 weeks of experiments in OpenAI’s Parameter Golf competition, why state space models (SSMs) are structurally disadvantaged versus transformers under strict time and size constraints (10-minute training, 16MB artifact, 25M parameters) using 8x H100s.
  • It finds that SSM in_proj weights compress up to 3.26x worse than attention QKV weights under LZMA, effectively consuming more of the compressed parameter budget.
  • It shows that architectural improvements can appear beneficial at smaller sequence settings (SP4096) but reverse at the target setting (SP8192), indicating sensitivity to scale and evaluation vocabulary.
  • The work also includes kernel-level experiments for Mamba-3 Triton kernels, including a backward fusion attempt that was numerically exact but ~16% slower due to SMEM pressure, and compilation/quantization and mixed-precision dynamics fixes with measurable mBPB gains.

After ~3 weeks of experimentation in OpenAI's Parameter Golf competition, I wrote up why SSMs are structurally disadvantaged relative to transformers in a time- and size-constrained regime (10 min training, 16MB artifact, 25M parameters) on 8xH100s: https://mradassaad.github.io/posts/why-ssms-struggle-in-parameter-golf/

Main findings:

  1. SSM in_proj weights compress up to 3.26x worse than attention QKV under LZMA, directly taxing the compressed parameter budget
  2. Architectural wins validated at SP4096 flipped sign at SP8192 — two configs that looked like clean wins reversed direction at the target vocabulary

Also includes three kernel-level experiments on the Mamba-3 Triton kernels: a backward fusion attempt that was numerically exact but 16% slower due to SMEM pressure, a torch.compile quantizer bug that cost 5.5 mBPB, and a mixed-precision dynamics protection that recovered 0.8 mBPB at negligible size cost.

submitted by /u/mradassaad
[link] [comments]