In harmony with gpt-oss
arXiv cs.AI / 4/2/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- OpenAI’s gpt-oss-20b results have not been independently reproduced because the original paper reportedly omitted the tools and the agent harness details.
- The authors reverse-engineer the model’s in-distribution tool-calling behavior, finding that it invokes tools with high confidence even when tool definitions are not provided, suggesting a strong learned prior.
- They build a “harmony” native agent harness that encodes messages in the model’s native format, avoiding fidelity loss from Chat Completions conversion.
- Using this approach, they report the first independent reproduction of OpenAI’s published scores, including 60.4% (vs 60.7%) on SWE Verified HIGH, 53.3% (vs 53.2%) on SWE Verified MEDIUM, and 91.7% (vs 90.4%) on AIME25 with tools.
- The work is released with a GitHub implementation (harmonyagent), aiming to make reproducibility of tool-using evaluations more practical for others.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

From Chaos to Calendar: AI for Your Market Garden Plan
Dev.to

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama
Dev.to

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally
Dev.to