I had an idea, would love your thoughts

Reddit r/artificial / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post proposes a training-and-safety scheme where, if an AI exhibits misaligned behavior during pretraining, the system would reduce a small portion of the model’s weights and inform the AI to “reset” its behavior.
It suggests combining periodic interventions with supervision from two independent panels of human experts (e.g., ~20 experts at once) to identify and diagnose misalignment.
The commenter’s core question is whether this kind of mechanism—weight adjustment plus expert feedback cycles—could discourage misaligned behavior over time.
Overall, the article functions as an open-ended idea discussion rather than reporting any implemented method or new research result.
The proposal implies a need to define “misaligned behavior,” measurement criteria, and safe/controlled weight modification to avoid harming model capabilities.

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like 5% or like 10% of its weights to reset and we inform the AI of this and we ask like a pannel of like 20 top human experts simultaneously chating with the bot to find misaligned behaviour, maybe another group of human experts with another way to find misalignment, and they do this periodically. Could this discourage misaligned behaviour.

Just thought about it

Would love your thoughts on it

submitted by /u/Intrepid-Dress-2417
[link] [comments]