RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context

arXiv cs.CL / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces RELIC, a scalable evaluation framework that tests whether an LLM can recognize membership in a context-free language defined by an in-context grammar.
  • By varying grammar size and input string length, RELIC controls task complexity and maps that complexity to expected “ideal” LLM performance.
  • Experiments show that even advanced reasoning models struggle on RELIC, failing to increase inference compute as difficulty rises.
  • The study finds that reduced computation correlates with a shift in reasoning strategy, with models moving from algorithmic problem-solving toward guessing (“quiet quitting” when outputs aren’t inspected).

Abstract

Large language models (LLMs) are increasingly used to solve complex tasks where they must retrieve and compose many pieces of in-context information in long reasoning chains. For many real-world tasks it is hard to accurately gauge how model performance and strategy change as task complexity grows. To evaluate models' complex reasoning capability in a scalable and verifiable way, we introduce RELIC (Recognition of Languages In-Context), a framework that evaluates an LLM's ability to decide whether a given string belongs to the context-free language (CFL) generated by a grammar presented in-context. CFL recognition allows us to modulate the intrinsic complexity of the problem by varying grammar size and string length and translate this asymptotic complexity into predictions for ideal LLM performance. We find that even the most advanced reasoning models perform poorly on RELIC, not only failing to appropriately scale their inference compute to keep pace with task difficulty, but even reducing the number of reasoning tokens they use as task complexity increases. We find that these decreases in compute accompany changes in reasoning strategy, as models move from identifying and implementing algorithmic solutions to guessing. For models whose full completions go uninspected, this manifests as ``quiet quitting'' on hard tasks.