From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code

arXiv cs.LG / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reframes LLMs as code generators that produce executable, human-readable decision logic to run deterministically over structured data, addressing interpretability and reproducibility in high-stakes decisions.
It couples code generation with automated statistical validation (precision lift, binomial significance testing, and coverage filtering) and cluster-based gap analysis to iteratively refine rules without human annotation.
The framework is demonstrated on venture capital founder screening (VCBench with 4,500 founders and a 9% base rate), achieving 37.5% precision and an F0.5 score of 25.0%, outperforming GPT-4o on precision while maintaining full interpretability.
Each prediction traces to executable, human-readable rules, enabling verifiable and auditable LLM-based decision-making in practice.
By eliminating per-sample LLM queries and enabling reproducible predictions, the approach aims to scale interpretable AI for high-stakes tasks.

Abstract

Large language models (LLMs) are increasingly used for high-stakes decision-making, yet existing approaches struggle to reconcile scalability, interpretability, and reproducibility. Black-box models obscure their reasoning, while recent LLM-based rule systems rely on per-sample evaluation, causing costs to scale with dataset size and introducing stochastic, hallucination-prone outputs. We propose reframing LLMs as code generators rather than per-instance evaluators. A single LLM call generates executable, human-readable decision logic that runs deterministically over structured data, eliminating per-sample LLM queries while enabling reproducible and auditable predictions. We combine code generation with automated statistical validation using precision lift, binomial significance testing, and coverage filtering, and apply cluster-based gap analysis to iteratively refine decision logic without human annotation. We instantiate this framework in venture capital founder screening, a rare-event prediction task with strong interpretability requirements. On VCBench, a benchmark of 4,500 founders with a 9% base success rate, our approach achieves 37.5% precision and an F0.5 score of 25.0%, outperforming GPT-4o (at 30.0% precision and an F0.5 score of 25.7%) while maintaining full interpretability. Each prediction traces to executable rules over human-readable attributes, demonstrating verifiable and interpretable LLM-based decision-making in practice.

Astral to Join OpenAI

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Why Data is Important for LLM

Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

Dev.to

From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code

Key Points

Abstract

Related Articles

Astral to Join OpenAI

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Why Data is Important for LLM

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer