Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles Offline Reinforcement Learning by proposing an inference-time adaptation scheme inspired by Model Predictive Control (MPC), enabling policy improvement without new environment interaction.
  • It introduces a Differentiable World Model (DWM) pipeline that supports end-to-end gradient computation through imagined rollouts, allowing policy parameters to be optimized on the fly during inference.
  • Unlike prior approaches that use learned dynamics mainly for training-time imagination or inference-time candidate sampling, the method explicitly leverages inference-time information to drive gradient-based policy updates.
  • Experiments on D4RL continuous-control benchmarks (MuJoCo locomotion and AntMaze) show consistent performance gains over strong offline RL baselines.
  • Overall, the work suggests a shift from static offline policy execution toward gradient-informed, model-based refinement at inference time using differentiable learned dynamics and rewards.

Abstract

Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without further refinement. We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables endto-end gradient computation through imagined rollouts for policy optimization at inference time based on MPC. We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.