Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

arXiv cs.LG / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses instability in reinforcement learning caused by noisy temporal-difference (TD) errors that arise from bootstrapping, which can destabilize value and policy learning.
It revisits the control-as-inference perspective and proposes a robust learning rule using a sigmoid-based distribution model of optimality, where large TD errors likely due to noise lead to gradient vanishing and are implicitly excluded from updates.
The method analyzes how forward vs. reverse KL divergences differently affect gradient-vanishing behavior, and uses this insight to design a learning update that remains stable under noisy TD signals.
It further decomposes optimality into multiple levels to “pseudo-quantize” TD errors for additional noise reduction, and derives an approximate Jensen–Shannon divergence-based alternative that combines favorable properties.
Experiments on RL benchmarks show stable learning even in settings where common heuristics (e.g., target networks, ensembles) or noisy rewards are not sufficient.

Abstract

In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.

Why I built an AI assistant that doesn't know who you are

Dev.to

DenseNet Paper Walkthrough: All Connected

Towards Data Science

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

Dev.to

The Facebook insider building content moderation for the AI era

TechCrunch

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

Reddit r/LocalLLaMA

Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

Key Points

Abstract

Related Articles

Why I built an AI assistant that doesn't know who you are

DenseNet Paper Walkthrough: All Connected

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

The Facebook insider building content moderation for the AI era

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer