A Reality Check of Language Models as Formalizers on Constraint Satisfaction Problems

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates whether large language models used as “formalizers” (turning problem statements into formal programs for external solvers) reliably improve performance on real-world constraint satisfaction problems.
  • Across 4 benchmarks, 6 LLMs, and 2 formal language types, LLM-as-formalizer underperforms LLM-as-solver in 15 of 24 model–dataset combinations, showing it does not simply trivialize the task despite higher verifiability and interpretability.
  • Even though the formalization search space is much smaller than end-to-end solver search space, scaling analysis finds that LLM-as-formalizer performance still degrades sharply as problem complexity increases, similar to solver-style approaches.
  • The authors identify a key limitation: the models sometimes produce excessive, solver-like reasoning tokens and even hard-code solutions, suggesting failure modes that future formalization methods must address.

Abstract

Recent work shows superior performance when using large language models (LLMs) as formalizers instead of as end-to-end solvers for symbolic reasoning problems. Given the problem description, the LLM generates a formal program that derives a solution via an external solver. We systematically investigate the formalization capability of LLMs on real-life constraint satisfaction problems on 4 benchmarks, 6 LLMs, and 2 types of formal languages. We show that LLM-as-formalizer by no means trivializes the problem but underperforms LLM-as-solver in 15 out of 24 model-dataset combinations, despite the former's verifiability and interpretability. Although the formalization space is magnitudes smaller than the search space, our scaling analysis shows that LLM-as-formalizer still drastically degrades as problem complexity increases similar to LLM-as-solver. To better understand this limitation, we observe excessive, solver-like reasoning tokens that sometimes lead to hard-coded solutions, highlighting a key challenge for improving LLM-based formalization.