K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces K-Score, a policy-gradient reinforcement learning method that replaces heuristic reward normalization with an online 1D Kalman filter for estimating reward statistics.
  • By recursively estimating the latent reward mean, K-Score smooths high-variance returns and can adapt to non-stationary environments during training.
  • The approach is designed to add minimal computational overhead and does not require any changes to existing policy network architectures.
  • Experiments on LunarLander and CartPole show that Kalman-filtered rewards improve training by accelerating convergence and reducing variance versus standard normalization methods.
  • The authors provide implementation code via https://github.com/Sumxiaa/Kalman_Normalization.

Abstract

We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high-variance returns and adapting to non-stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on \textit{LunarLander} and \textit{CartPole} demonstrate that Kalman-filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at https://github.com/Sumxiaa/Kalman_Normalization.