ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

arXiv cs.AI / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper highlights a “pre-realization evaluation” problem in long-horizon investing where realized returns arrive too late and are too noisy to guide AI-finance development and governance decisions.
It argues that unvalidated LLM judges may reward superficial behaviors (verbosity, confidence, rubric mimicry) rather than true financial judgment, motivating a more rigorous protocol.
ValueAlpha is introduced as a preregistered, agreement-gated stress-testing method that decides whether LLM-judged investment-rationale claims are publishable, qualified, or invalid.
In a controlled capital-allocation prototype (1,000 honest cycles plus adversarial controls), the method passes an overall agreement gate (κ̄w = 0.7168) while blocking several overclaims and identifying failure modes such as per-dimension constraint_awareness collapse and family-dependent rankings.
The authors position ValueAlpha as a pre-calibration “metrology” layer for AI-finance evaluation rather than a leaderboard or a measure of genuine investment skill.

Abstract

Long-horizon investment decisions create a pre-realization evaluation problem: realized returns are the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting substitute for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces \textbf{ValueAlpha}, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories, 5,500 judge calls), ValueAlpha clears the aggregate agreement gate at \(\bar{\kappa}_w = 0.7168\) but prevents several overclaims. Lower-rank systems collapse into a tie-class, one rubric dimension fails the per-dimension gate (\texttt{constraint\_awareness}, \(\bar{\kappa}_w = 0.2022\)), single-judge rankings are family-dependent, and terse-correct rationales receive a \(\Delta = -2.81\) rubric-point penalty relative to honest rationales. A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing. The contribution is therefore not a leaderboard and not a claim to measure true investment skill. ValueAlpha is a pre-calibration metrology layer for AI-finance evaluation: it determines whether a proposed LLM-judge-based investment-rationale claim is stable enough, agreed enough, and uncontaminated enough to be reported at all.

What to Build Still Beats How

Dev.to

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

Dev.to

v0.22.1

Ollama Releases

AI created job descriptions

Reddit r/artificial

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

Dev.to

ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

Key Points

Abstract

Related Articles

What to Build Still Beats How

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

v0.22.1

AI created job descriptions

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer