Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes “Discounted Reward for Same-Length Trajectories (DReST)” to make AI agents more shutdownable by discouraging repeated choices of same-length trajectories.
DReST is designed to encourage agents to be neutral about trajectory length (stochastic choice across different lengths) while still being useful for accomplishing goals.
The authors train deep RL agents with DReST and fine-tune an LLM with the same objective, evaluating whether these behaviors generalize to unseen contexts at test time.
Results show improved “Usefulness” versus baseline—11% higher with PPO and 18% higher with A2C—and the fine-tuned LLM reaches maximum usefulness with near-maximum neutrality.
The study provides early evidence that DReST could be a practical approach for training more advanced agents that balance usefulness with shutdown-resistance concerns.

Abstract

Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be Neutral about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be Useful). In this paper, we use DReST to train deep RL agents and fine-tune LLMs to be Neutral and Useful. We find that these DReST agents generalize to being Neutral and Useful in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher Usefulness on our test set than baseline agents, and our fine-tuned LLM achieves maximum Usefulness and near-maximum Neutrality. Our results provide some early evidence that DReST could be used to train more advanced agents to be Useful and Neutral. Prior theoretical work suggests that these agents would be useful and shutdownable.

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Reddit r/artificial

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

The Register

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

Dev.to

DEEPX and Hyundai Are Building Generative AI Robots

Dev.to

Stop Paying OpenAI to Read Garbage: The Two-Stage Agent Pipeline

Dev.to

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

Key Points

Abstract

Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

DEEPX and Hyundai Are Building Generative AI Robots

Stop Paying OpenAI to Read Garbage: The Two-Stage Agent Pipeline

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer