Is Compression Really Linear with Code Intelligence?

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how code-focused compression metrics relate to the capabilities of Large Language Models (LLMs), especially for code intelligence across varied languages and tasks.
  • It argues prior studies’ assumed linear relationship was incomplete due to limited, less fair evaluation of modern Code LLMs and insufficient coverage of real-world code diversity.
  • To improve evaluation, the authors propose a lightweight “Format Annealing” method and run experiments on a broad set of open-source Code LLMs using multi-language, multi-task benchmarks.
  • Using bits-per-character (BPC) measured on a newly created large GitHub-derived validation set, the study finds an intrinsic logarithmic (not linear) relationship between measured code intelligence and compression.
  • The authors interpret earlier linear-looking findings as possibly arising from the logarithmic curve’s tail under restricted experimental conditions, and they present a more robust framework for code-domain model assessment.

Abstract

Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
広告