LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

arXiv cs.AI / 2026/3/24

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper introduces LongCat-Flash-Prover, a 560B-parameter open-weights MoE model targeting native formal reasoning in Lean4 using agentic tool-integrated reasoning (TIR).
It breaks formal reasoning into three capabilities—auto-formalization, sketching, and proving—and uses a Hybrid-Experts Iteration Framework to expand high-quality training trajectories from informal problems to formal statements and full proofs.
For agentic reinforcement learning on these long-horizon tasks, the authors propose HisPO (Hierarchical Importance Sampling Policy Optimization) to stabilize MoE training via gradient masking that addresses policy staleness and train–inference discrepancies at sequence and token levels.
To reduce reward hacking, the system includes theorem consistency checks and legality detection mechanisms.
Experiments report new state-of-the-art results for open-weights models on auto-formalization and theorem proving, including a 97.1% pass rate on MiniF2F-Test with only 72 inference budget per problem and strong performance on ProverBench and PutnamBench.

Abstract

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.