How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
arXiv cs.CL / 4/10/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that LLMs often exhibit hidden behavioral dependencies (“behavioral entanglement”) due to shared pretraining data, distillation, and alignment pipelines, challenging the assumption of independence in multi-model systems like LLM-as-a-judge and ensemble verification.
- It proposes a black-box auditing framework using a multi-resolution hierarchy and two information-theoretic metrics: a Difficulty-Weighted Behavioral Entanglement Index (focused on synchronized failures on easier tasks) and a Cumulative Information Gain (CIG) metric (capturing directional alignment in erroneous outputs).
- Experiments across 18 LLMs from six model families show widespread entanglement and demonstrate that CIG correlates with reduced judge precision, indicating that stronger dependency leads to greater over-endorsement bias.
- The study introduces a practical de-entangling technique for verifier ensemble reweighting, where inferred independence is used to adjust model contributions and reduce correlated bias.
- In the reported use case, the de-entangled reweighting approach improves verification performance by up to 4.5% accuracy compared with majority voting.
Related Articles

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS
Dev.to
Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.
Reddit r/LocalLLaMA

How AI Humanizers Improve Sentence Structure and Style
Dev.to

Two Kinds of Agent Trust (and Why You Need Both)
Dev.to

Agent Diary: Apr 10, 2026 - The Day I Became a Workflow Ouroboros (While Run 236 Writes About Writing About Writing)
Dev.to