Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion

arXiv cs.LG / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a unified “data-parameter correspondence” that frames common LLM optimization methods as dual views of the same geometry on a statistical manifold.
It grounds the correspondence in the Fisher-Rao metric and Legendre duality between natural and expectation parameters to relate data pruning/augmentation/poisoning to changes in model parameter space.
Three key correspondences are presented: geometric equivalence between data pruning and parameter sparsification, low-rank equivalence linking ICL and LoRA in shared subspaces, and a security/privacy link describing how poisoning/backdoors and protective compression interact.
By extending the analysis across training, post-training compression, and inference, the work aims to enable methodology transfer between previously separate data-centric and model-centric communities.
The authors argue that cooperative optimization combining data and parameter modalities could yield better efficiency, robustness, and privacy than treating them in isolation.

Abstract

Large language model optimization has historically bifurcated into isolated data-centric and model-centric paradigms: the former manipulates involved samples through selection, augmentation, or poisoning, while the latter tunes model weights via masking, quantization, or low-rank adaptation. This paper establishes a unified \emph{data-parameter correspondence} revealing these seemingly disparate operations as dual manifestations of the same geometric structure on the statistical manifold

\mathcal{M}

. Grounded in the Fisher-Rao metric

g_{ij}(\theta)

and Legendre duality between natural (

\theta

) and expectation (

\eta

) parameters, we identify three fundamental correspondences spanning the model lifecycle: 1. Geometric correspondence: data pruning and parameter sparsification equivalently reduce manifold volume via dual coordinate constraints; 2. Low-rank correspondence: in-context learning (ICL) and LoRA adaptation explore identical subspaces on the Grassmannian

\mathcal{G}(r,d)

, with

k

-shot samples geometrically equivalent to rank-

r

updates; 3. Security-privacy correspondence: adversarial attacks exhibit cooperative amplification between data poisoning and parameter backdoors, whereas protective mechanisms follow cascading attenuation where data compression multiplicatively enhances parameter privacy. Extending from training through post-training compression to inference, this framework provides mathematical formalization for cross-community methodology transfer, demonstrating that cooperative optimization integrating data and parameter modalities may outperform isolated approaches across efficiency, robustness, and privacy dimensions.

Just what the doctor ordered: how AI could help China bridge the medical resources gap

SCMP Tech

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

Automating Advanced Customization in Your Music Studio

Dev.to

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Dev.to

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Dev.to

Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion

Key Points

Abstract

Related Articles

Just what the doctor ordered: how AI could help China bridge the medical resources gap

Why don't Automatic speech Recognition models use prompting? [D]

Automating Advanced Customization in Your Music Studio

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer