Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “Discounted Reward for Same-Length Trajectories (DReST)” to make AI agents more shutdownable by discouraging repeated choices of same-length trajectories.
  • DReST is designed to encourage agents to be neutral about trajectory length (stochastic choice across different lengths) while still being useful for accomplishing goals.
  • The authors train deep RL agents with DReST and fine-tune an LLM with the same objective, evaluating whether these behaviors generalize to unseen contexts at test time.
  • Results show improved “Usefulness” versus baseline—11% higher with PPO and 18% higher with A2C—and the fine-tuned LLM reaches maximum usefulness with near-maximum neutrality.
  • The study provides early evidence that DReST could be a practical approach for training more advanced agents that balance usefulness with shutdown-resistance concerns.

Abstract

Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be Neutral about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be Useful). In this paper, we use DReST to train deep RL agents and fine-tune LLMs to be Neutral and Useful. We find that these DReST agents generalize to being Neutral and Useful in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher Usefulness on our test set than baseline agents, and our fine-tuned LLM achieves maximum Usefulness and near-maximum Neutrality. Our results provide some early evidence that DReST could be used to train more advanced agents to be Useful and Neutral. Prior theoretical work suggests that these agents would be useful and shutdownable.