Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common data valuation approaches (e.g., row-count or token-count times a fixed quality coefficient) miss nonlinear, utility-relevant ways that data improves LLM performance.
  • It proposes a utility-aware, dynamic data pricing framework using three layers: token-level information density/quality metrics, empirical training-gain estimation (via influence functions, proxy models, and Data Shapley values), and cryptographic verifiability (hash commitments, Merkle trees, and a tamper-evident training ledger).
  • Experiments across three domains—instruction following, mathematical reasoning, and code summarization—show that proxy-based measures of empirical gain closely match realized utility and significantly beat simple token/count baselines.
  • The authors position the framework as enabling a more fair “Data-as-a-Service” market by pricing data according to its actual contribution to model intelligence, while adding transparency and auditability for data buyers and sellers.

Abstract

Traditional data valuation methods based on ``row-count \times quality coefficient'' paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.