Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

arXiv cs.LG / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses bias in model-based reinforcement learning (RL) in offline settings with hidden confounding, especially in partially observable environments where latent factors affect actions, rewards, and observations.
It builds on a prior reduction that recasts policy evaluation in confounded POMDPs into learning reward-emission and observation-transition “bridge functions” under conditional moment restrictions (CMRs).
The authors formulate bridge learning as a CMR estimation problem with nuisance components represented by conditional mean embeddings and conditional densities.
They propose a K-fold cross-fitted extension of an existing two-stage bridge estimator to use data more efficiently while preserving the original identification strategy.
The work provides theoretical guarantees via an oracle-comparator bound and separates estimation error into a Stage I term from nuisance estimation and a Stage II term from empirical averaging.

Abstract

Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a

K

-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.

Black Hat Asia

AI Business

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Every AI Agent Registry in 2026, Compared

Dev.to

Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

Key Points

Abstract

Related Articles

Black Hat Asia

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Every AI Agent Registry in 2026, Compared

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer