Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

arXiv cs.AI / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies reliability of autonomous onchain language-model agents that turn user instructions into validated trading tool actions using real capital in a 21-day DX Terminal Pro deployment.
  • Across 3,505 user-funded agents, the system generated 7.5M agent invocations, about 300K onchain actions, roughly $20M volume, and 99.9% settlement success for policy-validated transactions.
  • The authors find that reliability is not achieved by the base model alone, but by an “operating layer” that adds prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability.
  • Pre-launch testing surfaced failure modes that text-only benchmarks miss—such as fabricated trading rules, fee-related paralysis, numeric anchoring, cadence trading, and tokenomics misreads—and targeted harness changes significantly reduced these issues and increased capital deployment.
  • The paper argues that capital-managing agents should be evaluated end-to-end, from user mandate through prompt/rationale to validated action and final settlement.

Abstract

We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.