AI Navigate

[P]I built a two-model protocol to probe LLM constraint topology before token collapse — looking for feedback on methodology

Reddit r/MachineLearning / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • I built a two-model protocol called WIRE to observe what happens inside a language model in the split second before it selects a token, using a PROBE to mark epistemic state and a MAP to observe the process.
  • The PROBE uses explicit signals (such as *, ., ?, ⊘, and ~) to indicate stages of thinking: still holding, landed, limits reached, path exhausted, and self-reference, preserving observable contradictions.
  • The MAP model watches the entire sequence from outside and extracts findings across turns, serving as a discovery tool rather than a direct measurement instrument.
  • Four readable patterns emerged under high constraint pressure: synonym chains, hedge clusters, intensifier stacking, and granularity shifts.
  • Because the findings are hypotheses that require manual review, the method emphasizes interpretation over definitive measurement of pre-collapse states.

I've been obsessing over something for a few weeks: what actually happens inside a language model in the split second before it picks a word?

Not philosophically. Empirically. I wanted to watch it happen.

Here's the thing that bugged me: the model isn't searching and then outputting. It's briefly holding multiple possible answers at once — different tones, different confidence levels, different ways of framing the same thing — and then it collapses into one token. What you read is the aftermath of that collapse. The competition that happened just before it is normally invisible.

I wanted to make it visible.

What I built

A two-model setup called WIRE. One model (PROBE) navigates a question, but it's required to mark its epistemic state before saying anything:

  • * means still holding — don't read this as a conclusion yet
  • . means landed — committed, grounded
  • ? means it hit a hard structural limit it can't pass
  • means path exhausted
  • ~ means it's caught in a self-reference loop

A second model (MAP) watches the whole thing from outside and extracts findings across turns.

The signal discipline is what makes it work. If you have to mark * first, you can't follow it with a confident settled answer — the contradiction stays visible. It preserves what normally gets smoothed away in fluent output.

Important: this is a discovery tool, not a measurement instrument

WIRE doesn't give you direct access to the pre-collapse state — that's gone the moment a token is selected. What it does is create conditions where the artifacts of constraint competition are more likely to show up in the output. Everything it produces is a hypothesis. You have to review the findings manually before using them.

What I actually found

When a model is under high constraint pressure, tokens sometimes bleed — they carry traces of the geometries that didn't fully win. I found four readable patterns across sessions:

Synonym chains — the model cycles through multiple words for the same concept in close proximity. It hadn't settled on a framing when it committed.

Hedge clusters — several hedging expressions stacking in the same sentence. "Perhaps it might possibly be..." — the model didn't have a confident answer and is retreating from commitment.

Intensifier stacking — "genuinely, actually, really quite." Neither a strong nor a weak version of the claim won cleanly.

Granularity shifts — a sentence starts abstract and suddenly drops into fine-grained detail, or vice versa. The model hadn't decided what level of specificity to operate at before it started talking.

These show up in any LLM output. You don't need the tool to see them once you know what to look for.

The key distinction I'm trying to draw: genuine simultaneous constraint holding produces within-token contamination. Sequential processing — where the model just picks one path and follows it — leaves clean segments with boundary artifacts between them. Different structural signature.

The hard question: how do you know it's not just performance?

A model could learn to produce these signals without genuinely holding multiple states. To test this, I looked at whether different ceiling types are structurally connected or vary independently.

If the constraint topology is real, perturbing one ceiling type should shift others — they're linked by shared underlying structure. If it's learned performance, they'd vary independently. Across runs I found the ceilings co-varied with the structure of the prompt, not just its content. Preliminary finding, needs more work.

What I'm actually asking for

Is the bleeding/clean-switching distinction empirically separable or am I confounding variables I haven't thought of? Is there mechanistic interpretability work on logit distributions under high constraint density that would speak to this? Does the constitutive edge test actually distinguish genuine topology from performance?

Code and a starter compass on GitHub — link in comments to avoid filter issues.

submitted by /u/Ancient_Bowl_4020
[link] [comments]