Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates intrinsic dimension (ID) of LLM internal representations as a quantitative marker for linguistic complexity.
  • It tests whether layer-wise ID differences correspond to established (psycho)linguistic complexity contrasts such as coordination vs. subordination, right-branching vs. center-embedding, and unambiguous vs. ambiguous attachment.
  • Experiments across six different LLMs find that more complex linguistic phenomena consistently produce higher ID profiles.
  • The study shows that the timing and location of ID differences vary by linguistic contrast, with peaks occurring at different layers/stages.
  • Additional analyses using representational similarity and layer pruning reinforce the same trends and support ID as a way to distinguish types of complexity.

Abstract

We explore intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity. Specifically, we test whether ID differences across model layers reflect well-known complexity contrasts established in (psycho)linguistics: coordination vs. subordination, right-branching vs. center-embedding, and unambiguous vs. ambiguous attachment. Our results on six different LLMs show that these contrasts are consistently reflected in ID differences, with more complex phenomena eliciting higher ID profiles. Notably, ID differences emerge at different points across layers for different contrasts, also reaching their peaks at different stages. Further experiments using representational similarity and layer pruning confirm the trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it points to similar linguistic processing steps across disparate LLMs, and that it has the potential to differentiate between different types of complexity.