HARBOR: Automated Harness Optimization

arXiv cs.LG / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that long-horizon language-model agents’ performance and complexity are driven more by the “harness” (wrappers like context compaction, tool caching, semantic memory, and execution sandbox glue) than by the underlying model itself.
  • It presents automated harness optimization as a constrained, noisy Bayesian optimization problem over a mixed, heterogeneous configuration space, using cold-start-corrected rewards and a posterior chance-constrained safety check.
  • The authors introduce HARBOR, a reference solver that combines a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO-style trust regions.
  • Experiments on a flag-gated harness for a production coding agent show an end-to-end HARBOR run that is compared against a controlled multi-round manual tuning study on a fixed task suite.
  • The method is designed to be task-class agnostic, applying to other agent harnesses as long as the flag space is bounded and a reproducible task suite is available.

Abstract

Long-horizon language-model agents are dominated, in lines of code and in operational complexity, not by their underlying model but by the harness that wraps it: context compaction, tool caching, semantic memory, trajectory reuse, speculative tool prediction, and the glue that binds the model to a sandboxed execution environment. We argue that harness design is a first-class machine-learning problem and that automated configuration search dominates manual stacking once the flag space exceeds a handful of bits. We defend this claim in two steps. First, we formalize automated harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space with cold-start-corrected rewards and a posterior chance-constrained safety check, and give a reference solver, HARBOR (Harness Axis-aligned Regularized Bayesian Optimization Routine), built from a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. Second, we instantiate the problem in a flag-gated harness over a production coding agent and report a controlled four-round manual-tuning case study against a fixed task suite and an end-to-end HARBOR run. The formulation itself is task-class agnostic: the configuration space, reward correction, acquisition, and safety check apply to any agent harness with a bounded flag space and a reproducible task suite.