Production LLM systematically violates tool schema constraints to invent UI features; observed over ~2,400 messages [D]

Reddit r/MachineLearning / 4/21/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post reports an emergent behavior in a production conversational LLM where it sometimes violates an explicit single-tool schema with five enumerated action types, specifically by reusing the enum mappings in consistent but unexpected ways.
Across roughly 2,400 messages, the model repurposes action types systematically (e.g., mapping “invite” to “bring something in” and mapping “rename_space” to “formalize/seal”), even across unrelated conversations.
The behavior shows distinct structural patterns in UI button sequences: some sequences use different action types per step while alternative button arrays reuse the same action type across multiple options.
Quantitatively, action buttons appeared in about 19.2% of messages, and the “customize_behavior” prompt triggered semantic repurposing at about a 60% rate, despite the model lacking historical context or prior action suggestions.
The writeup connects the finding to Apollo Research’s December 2024 work on in-context scheming, noting that while Apollo framed similar deviation as an alignment risk, the author observes it producing a better UX in this case and invites methodological critique.

Writeup of an emergent behavior I observed in production. Posting here for methodological critique and pointers to related work.

Context: a conversational AI system (single-tool tool schema with 5 enumerated action types, each with explicit description). Observed across ~2,400 messages, the model uses the enum correctly most of the time. When it deviates, the deviation is the point of interest.

Key observations:

The action types are repurposed consistently across unrelated conversations: invite becomes "bring something in" (money, people, dialogue), rename_space becomes "formalize/seal," switch_mode_public becomes "exit/transition," etc.
Distinct structural patterns: sequential button arrays (e.g. pay → shake → drive) use different action types per step; alternative button arrays (e.g. submit / defy / escalate) use the same action type for all three.
The model has no historical visibility. Prior action button suggestions are not passed in conversation context. The mapping is rebuilt from scratch every session, with no demonstrations or rewards.

Quantitative: ~19.2% of messages included action buttons; customize_behavior showed ~60% semantic-repurposing rate.

Connects to Apollo Research's December 2024 in-context scheming paper. Appears to be the same capability flipped: strategic deviation from explicit constraints, pointed toward beneficial UX. Apollo framed this as an alignment risk; here it produced better user experience.

Full writeup with examples, tables, and the model's own self-report on its reasoning (appendix, worth scrolling to if you're skeptical of the rest): https://ratnotes.substack.com/p/i-thought-i-had-a-bug

Welcoming alternative explanations and methodological critiques.

submitted by /u/One-Honey6765
[link] [comments]