nvidia/gpt-oss-puzzle-88B · Hugging Face

Reddit r/LocalLLaMA / 3/26/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • gpt-oss-puzzle-88B は、NVIDIA が Puzzle(ポストトレーニングNAS)で最適化した配備向け大規模言語モデルで、元となる OpenAI の gpt-oss-120b から派生しています。
  • 推論効率を高めることを目的としており、長文/短文いずれの提供を NVIDIA H100 クラスで想定し、特に KV-cache の帯域・メモリ制約がボトルネックになりがちな推論負荷での改善を狙っています。
  • 親モデルに比べて総パラメータ数を約88B(約73%)に削減しつつ、長文(64K/64K)で 1.63×、短文(4K/4K)で 1.22×、単一H100では最大 2.82× のスループット向上を報告しています。
  • モデルはデコーダのみの Transformer で、Mixture-of-Experts(MoE)かつ層ごとにエキスパート数やグローバル/ウィンドウ注意パターンを変えた修正版 gpt-oss アーキテクチャが特徴です。
  • 推論精度は親モデルと同等か、推論量(reasoning efforts)の範囲でわずかに上回るとされています。
nvidia/gpt-oss-puzzle-88B · Hugging Face

gpt-oss-puzzle-88B is a deployment-optimized large language model developed by NVIDIA, derived from OpenAI's gpt-oss-120b.
The model is produced using Puzzle, a post-training neural architecture search (NAS) framework, with the goal of significantly improving inference efficiency for reasoning-heavy workloads while maintaining or improving accuracy across reasoning budgets.

The model is specifically optimized for long-context and short-context serving on NVIDIA H100-class hardware, where reasoning models are often bottlenecked by KV-cache bandwidth and memory capacity rather than raw compute.

Compared to its parent, gpt-oss-puzzle-88B:

  • Reduces total parameters to ~88B (≈73% of the parent),
  • Achieves 1.63× throughput improvement in long-context (64K/64K) scenarios on an 8×H100 node,
  • Achieves 1.22× throughput improvement in short-context (4K/4K) scenarios,
  • Delivers up to 2.82× throughput improvement on a single H100 GPU,
  • Matches or slightly exceeds parent accuracy across reasoning efforts.

Model Architecture

  • Architecture Type: Mixture-of-Experts Decoder-only Transformer
  • Network Architecture: Modified gpt-oss architecture with varying number of experts per layer, and a modified global/window attention pattern across layers.
  • Number of model parameters: 88B
submitted by /u/jacek2023
[link] [comments]