Surface Sensitivity in Lean 4 Autoformalization

arXiv cs.LG / 4/28/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether paraphrase-induced differences in Lean 4 autoformalization outputs stem from real semantic disagreement or from shallower failures.
Using 60 deterministic paraphrase rules on datasets ProofNet# and miniF2F, the authors test multiple GPT-family models and several open-weight 7B autoformalizers.
They find that when both the original and paraphrased outputs successfully compile, the paired formalizations are semantically equivalent under BEq+ and structurally very similar under GTED.
In contrast, paraphrasing strongly changes whether outputs compile, indicating that the main sensitivity comes from compilation-boundary failures rather than semantic divergence.
The authors recommend future training and benchmark designs to focus on compile-boundary robustness and to explicitly distinguish compile-conditional equivalence from surface consistency.

Abstract

Natural-language variation poses a key challenge in Lean autoformalization: semantically equivalent paraphrases of the same theorem statements can induce divergent formal outputs, yet it remains unclear whether this variation reflects semantic disagreements or shallower failures. We investigate this question in Lean 4 using 60 deterministic paraphrase rules applied to ProofNet\# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, we find that the observed paraphrase sensitivity reflects compilation-boundary failures rather than semantic divergence among successful formalizations. In particular, when both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes. Our results suggest that future training-time interventions should target the compile boundary rather than the semantic layer, and that benchmarks should separate compile-conditional equivalence from surface consistency.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

Surface Sensitivity in Lean 4 Autoformalization

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer