Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion

arXiv cs.LG / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a unified “data-parameter correspondence” that frames common LLM optimization methods as dual views of the same geometry on a statistical manifold.
  • It grounds the correspondence in the Fisher-Rao metric and Legendre duality between natural and expectation parameters to relate data pruning/augmentation/poisoning to changes in model parameter space.
  • Three key correspondences are presented: geometric equivalence between data pruning and parameter sparsification, low-rank equivalence linking ICL and LoRA in shared subspaces, and a security/privacy link describing how poisoning/backdoors and protective compression interact.
  • By extending the analysis across training, post-training compression, and inference, the work aims to enable methodology transfer between previously separate data-centric and model-centric communities.
  • The authors argue that cooperative optimization combining data and parameter modalities could yield better efficiency, robustness, and privacy than treating them in isolation.

Abstract

Large language model optimization has historically bifurcated into isolated data-centric and model-centric paradigms: the former manipulates involved samples through selection, augmentation, or poisoning, while the latter tunes model weights via masking, quantization, or low-rank adaptation. This paper establishes a unified \emph{data-parameter correspondence} revealing these seemingly disparate operations as dual manifestations of the same geometric structure on the statistical manifold \mathcal{M}. Grounded in the Fisher-Rao metric g_{ij}(\theta) and Legendre duality between natural (\theta) and expectation (\eta) parameters, we identify three fundamental correspondences spanning the model lifecycle: 1. Geometric correspondence: data pruning and parameter sparsification equivalently reduce manifold volume via dual coordinate constraints; 2. Low-rank correspondence: in-context learning (ICL) and LoRA adaptation explore identical subspaces on the Grassmannian \mathcal{G}(r,d), with k-shot samples geometrically equivalent to rank-r updates; 3. Security-privacy correspondence: adversarial attacks exhibit cooperative amplification between data poisoning and parameter backdoors, whereas protective mechanisms follow cascading attenuation where data compression multiplicatively enhances parameter privacy. Extending from training through post-training compression to inference, this framework provides mathematical formalization for cross-community methodology transfer, demonstrating that cooperative optimization integrating data and parameter modalities may outperform isolated approaches across efficiency, robustness, and privacy dimensions.