Execution-Verified Reinforcement Learning for Optimization Modeling

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Execution-Verified Optimization Modeling (EVOM), a closed-loop framework that uses a mathematical programming solver as a deterministic verifier for LLM-generated solver-specific code.
  • EVOM converts sandboxed execution outcomes into scalar rewards and trains via GRPO and DAPO, avoiding costly process-level supervision that can overfit to a single solver API.
  • By switching the verification environment (solver backend) rather than rebuilding solver-specific datasets, EVOM targets cross-solver generalization and zero-shot solver transfer.
  • Experiments across multiple optimization benchmarks (NL4OPT, MAMO, IndustryOR, OptiBench) and solver backends (Gurobi, OR-Tools, COPT) show EVOM matches or outperforms process-supervised SFT and supports low-cost adaptation by continuing training under a target solver.
  • The work positions execution-verified reinforcement learning as an alternative path to “scalable decision intelligence” using LLMs for automated optimization modeling.

Abstract

Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.