We fit a 24M-parameter LLM into 15MB with per-row MSE quantization

Reddit r/LocalLLaMA / 3/25/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The work described targets OpenAI’s “Parameter Golf” constraint of fitting a best-performing 24M-parameter LLM into a 16MB memory budget.
Instead of a single clipping threshold for INT8 quantization, the method searches multiple clip values per weight row and selects the one that minimizes reconstruction MSE, improving quantization fidelity at the cost of ~5× quantization time (~0.7s total).
The quantization approach determines a per-row scale from the chosen clip value, quantizes to int8, and evaluates reconstruction error (MSE) to pick the best candidate threshold.
Empirically, the author reports that in this parameter-squeezed regime, increasing width (e.g., 16M to 24M params) performs better than depth, with only ~3.6% fewer training steps lost when scaling up.

Working on OpenAI's Parameter Golf challenge (train best LLM possible, must fit in 16MB). Hit Top-3 on the leaderboard.

The quantization trick: instead of fixed-percentile INT8 clipping, we search 5 clip values per weight row and keep whichever gives lowest reconstruction MSE. Costs 5x quantization time (~0.7s total), gives measurable BPB improvement.

```python _GPTQ_CLIP_QS = [0.9999, 0.9995, 0.999, 0.998, 0.995]

def quantize_float_tensor(t): best_mse, best_q, best_s = float("inf"), None, None for clip_q in _GPTQ_CLIP_QS: clip = torch.quantile(t.abs(), clip_q) scale = clip / 127.0 q = (t / scale).round().clamp(-128, 127).to(torch.int8) recon = q.float() * scale mse = float((t - recon).pow(2).mean()) if mse < best_mse: best_mse, best_q, best_s = mse, q, scale return best_q, best_s ```

Also found that width scales better than depth in this regime - going from 16M to 24M params only costs ~3.6% fewer training steps.

Full code: https://github.com/openai/parameter-golf/pull/604

submitted by /u/TrashFun5286
[link] [comments]

Build a WhatsApp AI Assistant Using Laravel, Twilio and OpenAI

Dev.to

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

Anthropic shut down the Claude OAuth workaround. Here's the cheapest alternative in 2026.

Dev.to

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Dev.to

We fit a 24M-parameter LLM into 15MB with per-row MSE quantization

Key Points

Related Articles

Build a WhatsApp AI Assistant Using Laravel, Twilio and OpenAI

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Anthropic shut down the Claude OAuth workaround. Here's the cheapest alternative in 2026.

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer