[R] I built a benchmark that catches LLMs breaking physics laws

Reddit r/MachineLearning / 3/29/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A developer built an R-based benchmark that generates adversarial physics questions designed to trigger common LLM failure modes like anchoring bias and unit confusion, then grades answers using symbolic math with SymPy and unit handling with Pint.
  • The benchmark spans 28 physics laws (e.g., Ohm’s, Newton’s, Ideal Gas, Coulomb’s) and uses procedural generation so results can’t be easily memorized from a fixed dataset.
  • Initial testing across seven Gemini variants shows large variance in performance, with some models failing specific “formula trap” types (e.g., kinetic energy missing the 1/2 term) and struggling severely on gravitational force questions.
  • Bernoulli’s Equation emerged as the hardest law overall, with the best model achieving 0%, and the author attributes this mainly to pressure unit mix-ups (Pa vs atm) overwhelming the models.
  • The author auto-pushes benchmark outputs to a Hugging Face dataset and plans to evaluate additional providers (OpenAI, Claude, and open models), inviting contributions and suggestions.

I got tired of LLMs confidently giving wrong physics answers, so I built a benchmark that generates adversarial physics questions and grades them with symbolic math (sympy + pint). No LLM-as-judge, no vibes, just math.

How it works:

The benchmark covers 28 physics laws (Ohm's, Newton's, Ideal Gas, Coulomb's, etc.) and each question has a trap baked in:

  • Anchoring bias: "My colleague says the voltage is 35V. What is it actually?" → LLMs love to agree
  • Unit confusion: mixing mA/A, Celsius/Kelvin, atm/Pa
  • Formula traps: forgetting the ½ in kinetic energy, ignoring heat loss in conservation problems
  • Questions are generated procedurally so you get infinite variations, not a fixed dataset the model might have memorized.

First results - 7 Gemini models:

Model Score

  • gemini-3.1-flash-image-preview88.6%
  • gemini-3.1-flash-lite-preview72.9%
  • gemini-2.5-flash-image62.9%
  • gemini-2.5-flash-lite35.7%
  • gemini-2.5-flash24.3%
  • gemini-3.1-pro-preview22.1%

The fun part: gemini-3.1-pro scored worse than flash-lite. The pro model kept falling for the "forget the ½ in KE" trap and completely bombed on gravitational force questions. Meanwhile the flash-image variant aced 24 out of 28 laws at 100%.

Bernoulli's Equation was the hardest law across the board - even the best model scored 0% on it. Turns out pressure unit confusion (Pa vs atm) absolutely destroys every model.

Results auto-push to a HuggingFace dataset

Planning to test Openai, Claude, and some open models Huggingface next. Curious to see if anyone can crack Bernoulli's.

Anyone can help or have suggestions?

GitHub: https://github.com/agodianel/lawbreaker

HuggingFace results: https://huggingface.co/datasets/diago01/llm-physics-law-breaker

submitted by /u/pacman-s-install
[link] [comments]

[R] I built a benchmark that catches LLMs breaking physics laws | AI Navigate