We fit a 24M-parameter LLM into 15MB with per-row MSE quantization

Reddit r/LocalLLaMA / 3/25/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The work described targets OpenAI’s “Parameter Golf” constraint of fitting a best-performing 24M-parameter LLM into a 16MB memory budget.
  • Instead of a single clipping threshold for INT8 quantization, the method searches multiple clip values per weight row and selects the one that minimizes reconstruction MSE, improving quantization fidelity at the cost of ~5× quantization time (~0.7s total).
  • The quantization approach determines a per-row scale from the chosen clip value, quantizes to int8, and evaluates reconstruction error (MSE) to pick the best candidate threshold.
  • Empirically, the author reports that in this parameter-squeezed regime, increasing width (e.g., 16M to 24M params) performs better than depth, with only ~3.6% fewer training steps lost when scaling up.

Working on OpenAI's Parameter Golf challenge (train best LLM possible, must fit in 16MB). Hit Top-3 on the leaderboard.

The quantization trick: instead of fixed-percentile INT8 clipping, we search 5 clip values per weight row and keep whichever gives lowest reconstruction MSE. Costs 5x quantization time (~0.7s total), gives measurable BPB improvement.

```python _GPTQ_CLIP_QS = [0.9999, 0.9995, 0.999, 0.998, 0.995]

def quantize_float_tensor(t): best_mse, best_q, best_s = float("inf"), None, None for clip_q in _GPTQ_CLIP_QS: clip = torch.quantile(t.abs(), clip_q) scale = clip / 127.0 q = (t / scale).round().clamp(-128, 127).to(torch.int8) recon = q.float() * scale mse = float((t - recon).pow(2).mean()) if mse < best_mse: best_mse, best_q, best_s = mse, q, scale return best_q, best_s ```

Also found that width scales better than depth in this regime - going from 16M to 24M params only costs ~3.6% fewer training steps.

Full code: https://github.com/openai/parameter-golf/pull/604

submitted by /u/TrashFun5286
[link] [comments]