Regularized Centered Emphatic Temporal Difference Learning

arXiv cs.AI / 5/7/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper revisits a core tradeoff in off-policy temporal-difference (TD) learning with function approximation: balancing stability, projection geometry, and variance control.
While emphatic TD (ETD) improves off-policy projection geometry using follow-on emphasis, the follow-on trace can become high-variance and cause instability.
The authors show that a straightforward Bellman-error centering combined with emphatic extensions can introduce an auxiliary coupling that breaks positive-definiteness of ETD’s key matrix.
They propose Regularized Emphatic Temporal-Difference Learning (RETD), which keeps the follow-on trace and instead regularizes only the auxiliary centering recursion to maintain matrix positive-definiteness.
The paper derives the RETD core matrix, proves convergence under a conservative sufficient regularization condition, and demonstrates improved stability and robust behavior on diagnostic linear off-policy prediction tasks.

Abstract

Off-policy temporal-difference (TD) learning with function approximation faces a structural tradeoff among stability, projection geometry, and variance control. Emphatic TD (ETD) improves the off-policy projection geometry through follow-on emphasis, but the follow-on trace can have high variance. We revisit this tradeoff through Bellman-error centering. Although centering naturally removes a common drift term from TD errors, we show that a naive centered emphatic extension introduces an auxiliary coupling that can destroy the positive-definiteness of the ETD key matrix. We propose \emph{Regularized Emphatic Temporal-Difference Learning} (RETD), which preserves the follow-on trace and regularizes only the auxiliary centering recursion, corresponding to lifting the lower-right block of the coupled key matrix from \(1\) to \(1+c\). We derive the RETD core matrix, prove convergence under a conservative sufficient regularization condition, and evaluate the method on diagnostic linear off-policy prediction tasks. The experiments show that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for the regularization parameter \(c\) across the diagnostics.