Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

Reddit r/MachineLearning / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The report explains, based on ~3 weeks of experiments in OpenAI’s Parameter Golf competition, why state space models (SSMs) are structurally disadvantaged versus transformers under strict time and size constraints (10-minute training, 16MB artifact, 25M parameters) using 8x H100s.
It finds that SSM in_proj weights compress up to 3.26x worse than attention QKV weights under LZMA, effectively consuming more of the compressed parameter budget.
It shows that architectural improvements can appear beneficial at smaller sequence settings (SP4096) but reverse at the target setting (SP8192), indicating sensitivity to scale and evaluation vocabulary.
The work also includes kernel-level experiments for Mamba-3 Triton kernels, including a backward fusion attempt that was numerically exact but ~16% slower due to SMEM pressure, and compilation/quantization and mixed-precision dynamics fixes with measurable mBPB gains.

After ~3 weeks of experimentation in OpenAI's Parameter Golf competition, I wrote up why SSMs are structurally disadvantaged relative to transformers in a time- and size-constrained regime (10 min training, 16MB artifact, 25M parameters) on 8xH100s: https://mradassaad.github.io/posts/why-ssms-struggle-in-parameter-golf/

Main findings:

SSM in_proj weights compress up to 3.26x worse than attention QKV under LZMA, directly taxing the compressed parameter budget
Architectural wins validated at SP4096 flipped sign at SP8192 — two configs that looked like clean wins reversed direction at the target vocabulary

Also includes three kernel-level experiments on the Mamba-3 Triton kernels: a backward fusion attempt that was numerically exact but 16% slower due to SMEM pressure, a torch.compile quantizer bug that cost 5.5 mBPB, and a mixed-precision dynamics protection that recovered 0.8 mBPB at negligible size cost.

submitted by /u/mradassaad
[link] [comments]

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B

Reddit r/LocalLLaMA

Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

Key Points

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Roundtable chat with Talkie-1930 and Gemma 4 31B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer