Formalizing statistical learning theory in Lean 4 [R]

Reddit r/MachineLearning / 5/9/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The project FormalSLT is working to formalize key parts of statistical learning theory in Lean 4, focusing on rigorous, readable theorem development.
  • Current formalized results cover multiple standard learning-theory tools and bounds, including finite-class ERM bounds, Rademacher symmetrization, high-probability Rademacher bounds, VC-dimension connections (Sauer–Shelah), scalar contraction, linear predictor bounds, finite PAC-Bayes bounds, and algorithmic stability.
  • The author’s main design goal is to create a “theorem ladder” with explicit assumptions, scoped theorem statements, and no use of Lean’s placeholder proofs (no `sorry`).
  • Compared with other Lean SLT efforts that emphasize abstract probability and empirical-process infrastructure, this work prioritizes explicit finite-sample PAC/Rademacher/stability proof routes with end-to-end theorem chains aligned to standard SLT presentations.
  • The author is seeking feedback on theorem organization, proof structure, naming/API decisions, and suggestions for useful next targets to formalize.
Formalizing statistical learning theory in Lean 4 [R]

I’ve been working on a Lean 4 project focused on formalizing parts of statistical learning theory:

FormalSLT repository

Current results include:

  • finite-class ERM bounds
  • Rademacher symmetrization
  • high-probability Rademacher bounds
  • Sauer–Shelah / VC-dimension bridge
  • finite scalar contraction
  • linear predictor bounds
  • finite PAC-Bayes bounds
  • algorithmic stability

The main idea is to build a readable and pedagogically structured “theorem ladder” for ML theory rather than just isolated declarations.

I’m trying to keep:

  • explicit assumptions
  • scoped theorem statements
  • zero sorry
  • close alignment with standard SLT presentations

Compared to some existing Lean SLT efforts that focus more heavily on empirical-process infrastructure and abstract probability machinery, this project is currently more focused on explicit finite-sample PAC/Rademacher/stability routes and readable end-to-end theorem chains.

I’d especially appreciate feedback on:

  • theorem organization
  • proof structure
  • naming/API decisions
  • useful next formalization targets

Thank you,
R. S

submitted by /u/trickyrex1
[link] [comments]

Formalizing statistical learning theory in Lean 4 [R] | AI Navigate