Why the same codebase should always produce the same audit score

Dev.to / 4/2/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • LLM-based code audit tools can produce materially different audit scores for the same repository and inputs, undermining the credibility of the assessment as an “audit.”
  • The primary driver is that LLMs are probabilistic by default (e.g., non-zero temperature introduces randomness), which leads to different findings that then propagate into scoring.
  • Simply setting temperature to zero is necessary but not sufficient; additional variance can re-enter through nondeterministic consensus/confidence-weighting logic when multiple models disagree on borderline cases.
  • IntentGuard addresses this by using a deterministic consensus pipeline across up to four independent AI models per finding and by grounding severity scoring in CVSS v3.1-derived metrics.
  • The article argues that for security/compliance/architectural scoring, determinism is a structural requirement rather than a mere implementation detail.

There is a failure mode in AI-powered analysis tools that does not get talked about enough, and we ran into it directly.

When you submit the same repository twice — same commit, same inputs, same everything — you should get the same score. If the score changes between runs, the audit is not an audit. It is a random sample.

Early in testing, we observed score variance across consecutive runs on identical inputs. Not small variance. Meaningful swings — enough to change the risk interpretation of a codebase entirely. A score that sits in one category on one run and a different category on the next is worse than useless for the people who depend on it most: founders preparing investor materials, compliance leads building audit evidence, CTOs making remediation decisions.

This is a structural problem with LLM-based analysis, not an implementation bug, and it has a structural cause.

Where the variance comes from

Large language models are probabilistic by default. They sample from a probability distribution when generating output. The "temperature" setting controls how much randomness is introduced — higher temperature means more creative, more varied output. Lower temperature means more consistent, more deterministic output.

For creative tasks — writing, ideation, brainstorming — temperature is a feature. For security analysis, compliance mapping, and architectural assessment, temperature is a liability.

An LLM running at a non-zero temperature will produce slightly different findings on the same code across consecutive runs. Different findings feed into the scoring model. Different scores come out. The same codebase looks different on Tuesday than it did on Monday for no reason that reflects anything about the code.

The fix and what it requires

Setting temperature to zero eliminates sampling randomness. Given the same inputs, the model produces the same outputs. That is the starting point.
But there is a second layer of variance that temperature alone does not solve: finding confidence weighting. When multiple independent models analyse the same code, they may reach different conclusions on borderline cases. How those disagreements are resolved affects the final score — and if the resolution is inconsistent, variance returns through a different door.

IntentGuard uses a consensus pipeline across up to four independent AI models per finding. For the scoring model to be deterministic, the consensus logic itself must be deterministic — the same set of model votes must always produce the same confidence-weighted outcome.

We use CVSS v3.1-derived severity scoring as the foundation. CVSS is an industry standard specifically designed for this purpose: reproducible, quantifiable risk scores that two different analysts, given the same evidence, will calculate the same way. Mapping LLM-generated findings to CVSS-derived scores gives the scoring model a deterministic anchor — the same evidence produces the same deduction, every time.

Why this matters more for some users than others

For a developer running a quick check, score consistency is a nice-to-have. For the use cases IntentGuard is built for, it is non-negotiable. A VC performing technical due diligence on a portfolio company needs to know that the score they see reflects the actual state of the codebase — not the state it happened to be in on the particular run they triggered. A compliance lead building audit evidence needs findings that are reproducible and defensible. A founder preparing investor materials cannot present a Technical Readiness Score that might have read differently yesterday.

Deterministic scoring is what separates an analytical instrument from a magic eight ball.

The test that now passes

The gate we set for ourselves was simple: submit the same repository three times in succession with identical inputs and confirm the score is identical across all three runs.

That gate is now passing. 368 automated tests, including the determinism checks, are green.

Building IntentGuard in public from Johannesburg 🇿🇦. If deterministic analysis in multi-model AI pipelines is something you have thought about — whether you agree with the approach or see gaps — I would like to hear it in the comments.

The concepts discussed are my own, the presentation and formating of this post is enhanced by an AI text editor.

Olebeng · Founder, IntentGuard · intentguard.dev