WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

arXiv cs.LG / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces WOMBET, a framework for reinforcement learning that performs experience transfer by jointly generating and using prior data rather than relying on a fixed, assumed dataset.
  • WOMBET learns a world model in a source task and generates offline trajectories using uncertainty-penalized planning, then filters for trajectories that have high return and low epistemic uncertainty.
  • It supports a stable handoff to the target task via online fine-tuning with adaptive sampling that balances offline (prior-generated) data and online (target-collected) experience.
  • The authors provide theoretical support by relating the uncertainty-penalized objective to a lower bound on true return and decomposing finite-sample errors into distribution mismatch and approximation error.
  • Experiments on continuous-control benchmarks show improved sample efficiency and stronger final performance versus strong baseline methods, highlighting the value of co-optimizing data generation and transfer.

Abstract

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textit{World Model-based Experience Transfer} (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.