Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations
arXiv cs.CL / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper examines whether LLMs internally represent contextual privacy norms (based on contextual integrity theory) and why they still disclose private information in high-stakes scenarios.
- It reports that three contextual-integrity parameters—information type, recipient, and transmission principle—appear in activation space as linearly separable and functionally independent directions across multiple models.
- Despite this internal encoding, the study finds persistent privacy leakage, indicating a mismatch between what the model represents and how it actually behaves.
- The authors propose “CI-parametric steering,” which makes targeted interventions along each CI dimension to reduce privacy violations more effectively than traditional single-shot (monolithic) steering.
- Overall, the results suggest contextual privacy failures stem from representation–behavior misalignment rather than an absence of internal awareness of privacy concepts.
Related Articles

Black Hat Asia
AI Business

Mistral raises $830M, 9fin hits unicorn status, and new Tech.eu Summit speakers unveiled
Tech.eu

ChatGPT costs $20/month. I built an alternative for $2.99.
Dev.to

OpenAI shifts to usage-based pricing for Codex in ChatGPT business plans
THE DECODER

Why I built an AI assistant that doesn't know who you are
Dev.to