Emergent Inference-Time Semantic Contamination via In-Context Priming

arXiv cs.CL / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLMs can exhibit “inference-time semantic contamination” where injecting certain few-shot examples causes measurable distributional shifts in later, semantically unrelated prompts.
  • It revisits prior claims that k-shot prompting alone is insufficient, showing the effect can occur but depends on model capability, with richer models exhibiting stronger drift.
  • In a controlled setup using five culturally loaded numbers as demonstrations, the authors observe shifts toward darker, authoritarian, and stigmatized themes for capable models, while a simpler/smaller model shows no significant effect.
  • The study also finds that even structurally inert demonstrations (nonsense strings) can perturb output distributions, suggesting two mechanisms: structural-format contamination and semantic-content contamination.
  • The authors map boundary conditions for when contamination occurs and highlight direct security implications for LLM applications that rely on few-shot prompting.

Abstract

Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that k-shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.

Emergent Inference-Time Semantic Contamination via In-Context Priming | AI Navigate