I ran the same double pendulum generation contract against Claude 3.5 Sonnet and DeepSeek V3 on OpenRouter, both under identical initial conditions (θ1 = π/2, θ2 = π/2, both angular velocities zero). The host renderer in public/workers/simulator-host.js reads info.theta1 and info.theta2 from whatever the model's getInfo() returns, then draws both bobs using a fixed pivot at top center and a fixed scale derived from L1+L2. It does not care what convention the model used internally. It just plots the angle it receives.
Within the first second of simulation, the two panels looked like mirror images. Claude's pendulum hung downward and swung as expected from a horizontal release. DeepSeek's pendulum pointed upward from the pivot, as if the initial condition meant "π/2 from the downward vertical" rather than "π/2 from the upward vertical." Both panels rendered through the exact same drawing code. The only thing that differed was the output of step() and getInfo().
The reason this surfaces so cleanly is the contract design. Models implement step(dt), getInfo(), and reset() only. They never write a draw function. The host owns every pixel. So there is no way for a model to mask a convention choice behind its own rendering logic. If model A measures θ from the positive y axis (up) and model B measures from the negative y axis (down), the host draws them both the same way, and the mismatch is immediately visible as a spatial flip.
The generation contract lives in lib/prompt.ts. The model receives a system message specifying the equations of motion and the initial conditions, then must return exactly one fenced code block where the first line is function createSimulator(. No imports, no exports, no DOM access, no draw. The prompt does specify the angle convention, but the two models interpreted the same sentence differently. I checked the cached transcripts in generated-simulators/<slug>.trace.json and both models acknowledged the convention in their chain of thought before writing code that disagreed with each other.
This is from a small project called Physics Bench, built with Verdent. It currently covers one problem (double pendulum) and has no scoring pipeline. It just runs the models side by side and lets you watch. The interesting part is how many subtle disagreements become obvious when you strip away the model's ability to control rendering. Convention mismatch is the most visually dramatic, but I have also seen models diverge on the sign of the gravitational torque term, which produces a slower drift rather than an instant flip.
For anyone who wants to try swapping in other models: the contract is strict enough that most models on OpenRouter can produce a valid simulator on the first attempt, and when they fail (NaN propagation, truncation at SIMULATOR_MAX_TOKENS = 16000), there is a correction loop that feeds the error back into the same conversation as a user message so the model can patch its own code without losing context.
Curious whether anyone else has run into convention ambiguity when prompting models to implement physics from equations of motion, and whether you found a prompt phrasing that reliably disambiguates it.
[link] [comments]



