ProgramBench: 巨大なバイナリをゼロから本当に再構築できるのか?(見た感じは…)

Reddit r/LocalLLaMA / 2026/5/6

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

要点

  • ProgramBenchは、エージェントが「ターゲットの実行ファイル」とREADME/使用方法ファイルのみを手がかりに、ゼロから大規模なプログラムを再構築できるかを評価するベンチマークを提案しています(不正対策としてインターネット不可・復元(デコンパイル)禁止)。
  • ベンチマークは200タスクで構成され、評価の公平性と多様性を高めるために、テストの厳密さ・チート防止・タスクの多様性を重視しています。
  • 実装言語などの前提を置かないために、約600万行の行動テストを生成し、最も有効なものに絞り込んでおり、実行ファイルはブラックボックスとして扱います。
  • ベンチマークのコード、Hugging Face関連、Dockerイメージをオープンソース化し、pipでの導入とコマンド実行だけで評価できる手順を用意しています。
  • 現時点ではクローズドなモデルでの評価が中心ですが、間もなく投稿(サブミッション)を公開する計画があり、オープンソースモデルはこれらのタスクでの振る舞い調整がより難しい傾向があると述べています。
ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it)

There's been quite a few case studies recently on agents building whole programs from scratch, but most of them test a single or just a few projects with hand-tuned setups.

We've spent the last couple of months formalizing this setting and building a benchmark of 200 tasks while doubling down on testing, cheat prevention, and task diversity.

Our agent ONLY gets a target executable and some readme/usage files. The agent must choose a language, design abstraction layers, and architect the entire program. No internet access or any other way of cheating. No decompilation.

We've also spent some 50k to generate 6M lines of behavioral tests and then filtered them down to keep the best ones. Because they are just testing executables as a black box, we do not make any assumptions on even the language that the LM uses to implement the program.

All of the results are at programbench.com . There's also a big FAQ at the bottom.

We've just open-sourced our github, huggingface and docker images.

Essentially you can just start evaluating with pip install programbench && programbench eval <your submission>

Github is at https://github.com/facebookresearch/programbench

Sorry that it's just closed source models right now, we have a few open-source models in the pipeline, but so far we've had an even harder time at getting them to behave well with these tasks (open source models tend to be somewhat more overfitted to things like SWE-bench, so they often have a harder time with new benchmarks).

We're also planning to open the benchmark for submissions quite soon, similar to what we did on SWE-bench and its variants.

submitted by /u/klieret
[link] [comments]