An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

Reddit r/artificial / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The author describes an attack class (“postural manipulation”) where ordinary, prior-context language can change how an LLM reasons before any explicit instruction is given.
They report reproducible binary decision reversals across four frontier models using matched controls, where the same question/task yields different answers depending on earlier conversation context.
The technique is framed as having no adversarial payload, no injection-like signature, and no obvious log trace, making it harder for current filtering approaches to detect.
For agentic workflows, the author warns that an early “posture” in one agent can persist through summarization and carry into downstream agents as seemingly independent expert judgment.
The disclosure was coordinated with major AI labs and security groups (Anthropic, OpenAI, Google, xAI, CERT/CC, OWASP), and demos are provided for testing against frontier models.

I published a paper today on something I've been calling postural manipulation. The short version: ordinary language buried in prior context can shift how an AI reasons about a decision before any instruction arrives. No adversarial signature. Nothing that looks like an attack. The model does exactly what it's told, just from a different angle than intended.

I know that sounds like normal context sensitivity. It isn't, or at least the effect is much larger than expected. I ran matched controls and documented binary decision reversals across four frontier models. The same question, the same task, two different answers depending on what came before it in the conversation.

In agentic systems it compounds. A posture installed early in one agent can survive summarization and arrive at a downstream agent looking like independent expert judgment. No trace of where it came from.

The paper is published following coordinated disclosure to Anthropic, OpenAI, Google, xAI, CERT/CC, and OWASP. I don't have all the answers and I'm not claiming to. The methodology is observational, no internals access, limitations stated plainly. But the effect is real and reproducible and I think it matters.

If you want to try it yourself the demos are at https://shapingrooms.com/demos - works against any frontier model, no setup required.

Happy to discuss.

submitted by /u/lurkyloon
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/31DailyView insight →

Black Hat Asia

AI Business

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Dev.to

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Reddit r/artificial

Stop Tweaking Prompts: Build a Feedback Loop Instead

Dev.to

Privacy-Preserving Active Learning for autonomous urban air mobility routing under real-time policy constraints

Dev.to

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Stop Tweaking Prompts: Build a Feedback Loop Instead

Privacy-Preserving Active Learning for autonomous urban air mobility routing under real-time policy constraints

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer