Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

arXiv cs.AI / 4/22/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that long-horizon enterprise AI agents need evaluation beyond a single task-success score because it conflates failure modes and does not reveal whether agents meet deployment-specific regulatory standards.
It proposes a four-axis, independently measurable framework for decision alignment: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR), with CRR explicitly grounded in regulatory requirements.
Experiments on LongHorizon-Bench for loan qualification and insurance claims adjudication show that aggregate accuracy can miss key issues—e.g., retrieval failures mainly harm factual precision and schema-anchored methods incur a scaffolding tax.
The study finds that a straightforward fact-preservation summarization prompt can be a strong baseline across several axes, while all evaluated architectures make commitments on every case, revealing an unaddressed decisional-alignment problem.
The authors claim the framework generalizes to regulated domains by building a fact schema and calibrating a CRR auditor prompt to assess regulatory alignment.

Abstract

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.