Diagnosing CFG Interpretation in LLMs

arXiv cs.AI / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates whether LLMs can act as in-context interpreters for newly provided context-free grammars, producing outputs that are syntactically valid, behaviorally functional, and semantically faithful.
  • It introduces RoboGrid, a framework that separates syntax, behavior, and semantics using stress tests across recursion depth, expression complexity, and surface-style variations.
  • Experiments show a hierarchical degradation pattern where LLMs may keep surface-level syntax but increasingly fail to preserve structural semantics, especially under deep recursion and high branching.
  • Chain-of-Thought (CoT) helps only partially; at high structural density and extreme depths, semantic alignment collapses.
  • Using “Alien” lexicons, the study finds the models depend on semantic bootstrapping from keywords rather than purely symbolic induction, highlighting gaps in hierarchical state tracking for grammar-agnostic agents.

Abstract

As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.