Same double-pendulum prompt, same host renderer, and two models picked opposite θ conventions. You can see it within seconds.

Reddit r/LocalLLaMA / 5/16/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep Analysis

Key Points

  • The author ran the same double-pendulum simulation contract on Claude 3.5 Sonnet and DeepSeek V3 via OpenRouter and found that the two models produced opposite θ (angle) conventions in the rendered output.
  • Because the host renderer reads θ1 and θ2 strictly from each model’s getInfo() and draws with identical code, any mismatch in the model’s internal convention appears immediately as a mirror-image spatial flip.
  • The simulation contract is designed so models can only implement step(dt), getInfo(), and reset(), while the host owns all rendering pixels, preventing models from masking convention differences with custom drawing logic.
  • Cached execution traces show both models appeared to acknowledge the specified angle convention, yet still generated code that interpreted it differently.
  • The project, Physics Bench, currently focuses on a single double-pendulum case and visually compares model behavior side by side, making subtle physics disagreements—such as gravity torque sign errors—easy to spot.

I ran the same double pendulum generation contract against Claude 3.5 Sonnet and DeepSeek V3 on OpenRouter, both under identical initial conditions (θ1 = π/2, θ2 = π/2, both angular velocities zero). The host renderer in public/workers/simulator-host.js reads info.theta1 and info.theta2 from whatever the model's getInfo() returns, then draws both bobs using a fixed pivot at top center and a fixed scale derived from L1+L2. It does not care what convention the model used internally. It just plots the angle it receives.

Within the first second of simulation, the two panels looked like mirror images. Claude's pendulum hung downward and swung as expected from a horizontal release. DeepSeek's pendulum pointed upward from the pivot, as if the initial condition meant "π/2 from the downward vertical" rather than "π/2 from the upward vertical." Both panels rendered through the exact same drawing code. The only thing that differed was the output of step() and getInfo().

The reason this surfaces so cleanly is the contract design. Models implement step(dt), getInfo(), and reset() only. They never write a draw function. The host owns every pixel. So there is no way for a model to mask a convention choice behind its own rendering logic. If model A measures θ from the positive y axis (up) and model B measures from the negative y axis (down), the host draws them both the same way, and the mismatch is immediately visible as a spatial flip.

The generation contract lives in lib/prompt.ts. The model receives a system message specifying the equations of motion and the initial conditions, then must return exactly one fenced code block where the first line is function createSimulator(. No imports, no exports, no DOM access, no draw. The prompt does specify the angle convention, but the two models interpreted the same sentence differently. I checked the cached transcripts in generated-simulators/<slug>.trace.json and both models acknowledged the convention in their chain of thought before writing code that disagreed with each other.

This is from a small project called Physics Bench, built with Verdent. It currently covers one problem (double pendulum) and has no scoring pipeline. It just runs the models side by side and lets you watch. The interesting part is how many subtle disagreements become obvious when you strip away the model's ability to control rendering. Convention mismatch is the most visually dramatic, but I have also seen models diverge on the sign of the gravitational torque term, which produces a slower drift rather than an instant flip.

For anyone who wants to try swapping in other models: the contract is strict enough that most models on OpenRouter can produce a valid simulator on the first attempt, and when they fail (NaN propagation, truncation at SIMULATOR_MAX_TOKENS = 16000), there is a correction loop that feeds the error back into the same conversation as a user message so the model can patch its own code without losing context.

Curious whether anyone else has run into convention ambiguity when prompting models to implement physics from equations of motion, and whether you found a prompt phrasing that reliably disambiguates it.

submitted by /u/Independent_Plum_489
[link] [comments]