Test-Time Safety Alignment

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether input word embeddings can reliably steer “aligned” language models toward safer outputs, beyond prior demonstrations on reducing simple profanity in short text continuations.
  • It proposes optimizing the embeddings in a sub-lexical way to minimize the semantic harmfulness of responses from aligned models that typically follow a bimodal refuse-or-comply distribution.
  • The method treats a text-moderation API as a black box, uses zeroth-order gradient estimation with respect to the input embeddings, and then applies gradient descent to reduce harmfulness.
  • Experiments on standard safety benchmarks show the approach can neutralize every response that was flagged by safety checks, indicating strong control over safety outcomes.

Abstract

Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.