Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
arXiv cs.AI / 4/22/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a scalable experimental framework to measure how sensitive LLM-based “judges” are to small semantic changes when comparing two documents, varying many perturbation and context factors.
- Testing five LLMs across tens of thousands of document pairs shows a positional bias: many models penalize semantic differences more strongly when they occur earlier in a document.
- The study finds that putting a semantically altered sentence in topically unrelated surrounding context lowers similarity scores and can produce “bipolarized” outcomes (either very low or very high similarity).
- Scoring behavior differs by model identity, yielding distinct but stable “fingerprints” for each LLM, while all models follow a shared ordering in how leniently they respond to different perturbation types.
- The authors argue these results highlight that LLM similarity scoring depends not only on the semantic change itself, but also on document structure and context coherence, and they provide an LLM-agnostic auditing toolkit.
![AI TikTok Marketing for Pet Brands [2026 Guide]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Fj35r9qm34d68qf2gq7no.png&w=3840&q=75)


