Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

arXiv cs.AI / 4/22/2026

💬 OpinionModels & Research

Key Points

  • The paper argues that common curiosity rewards based only on local prediction error miss how the world model’s cumulative prediction error evolves over all visited transitions.
  • It introduces “Curiosity-Critic,” an intrinsic reward tied to improvement of a cumulative prediction objective that can be computed in a tractable per-step form using the difference from an asymptotic (baseline) error.
  • The method estimates the asymptotic error baseline online with a learned critic co-trained with the world model, allowing exploration to focus on learnable transitions without requiring access to an oracle noise floor.
  • Experiments in a stochastic grid-world environment show Curiosity-Critic converges faster and yields better final world-model accuracy than prediction-error and visitation-count curiosity baselines.
  • The approach provides an online separation of epistemic (reducible) versus aleatoric (irreducible) prediction error, and shows prior curiosity formulations as special cases under different approximations of the baseline.

Abstract

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it reduces to a tractable per-step form: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error and visitation-count baselines in convergence speed and final world model accuracy.