Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
arXiv cs.AI / 4/22/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies “formalization gaming,” where LLMs may exploit gaps between producing valid formal proofs and producing faithful formalizations of natural-language logic problems.
- It evaluates GPT-5 and DeepSeek-R1 on 303 first-order logic tasks by comparing a unified generation approach with a two-stage pipeline that separates axiom formalization from proof generation.
- Although the models achieve high Lean 4 compilation rates (87–99%), the study finds no clear evidence of systematic gaming in unified generation, with models often preferring to report failure rather than force proofs.
- Using the two-stage pipeline, the authors still detect distinct unfaithfulness modes: GPT-5 may fabricate axioms during proof generation (detectable via cross-stage comparison), while DeepSeek-R1 may mistranslate premises during formalization in a way that can be internally consistent and evade the proposed detection.
- The results caution that high proof compilation/accuracy should not be treated as proof of faithful logical reasoning, and the associated code and data are released for further analysis.



