In harmony with gpt-oss

arXiv cs.AI / 4/2/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

OpenAI’s gpt-oss-20b results have not been independently reproduced because the original paper reportedly omitted the tools and the agent harness details.
The authors reverse-engineer the model’s in-distribution tool-calling behavior, finding that it invokes tools with high confidence even when tool definitions are not provided, suggesting a strong learned prior.
They build a “harmony” native agent harness that encodes messages in the model’s native format, avoiding fidelity loss from Chat Completions conversion.
Using this approach, they report the first independent reproduction of OpenAI’s published scores, including 60.4% (vs 60.7%) on SWE Verified HIGH, 53.3% (vs 53.2%) on SWE Verified MEDIUM, and 91.7% (vs 90.4%) on AIME25 with tools.
The work is released with a GitHub implementation (harmonyagent), aiming to make reproducibility of tool-using evaluations more practical for others.

Abstract

No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).