Reinforcement Learning Trained Observer Control for Bearings-Only Tracking

arXiv cs.AI / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a deep reinforcement learning observer-control policy for autonomous bearings-only tracking of a moving target by formulating the problem as a belief Markov decision process.
  • The belief state is represented using the posterior of a cubature Kalman filter (CKF), linking the learned controller to state estimation uncertainty.
  • A reward function balances two competing goals—minimizing Euclidean position estimation error and maintaining CKF consistency measured via Mahalanobis distance—by interpolating along the Pareto front with a weighting parameter β.
  • The controller is trained as a deep Q-network (DQN) over 50,000 episodes and evaluated via 5,000 Monte Carlo runs, outperforming two baselines including a D-optimal Fisher information maximization criterion.
  • At β = 0.7, the DQN achieves the best accuracy–robustness trade-off, matching the information-theoretic baseline on mean accuracy while cutting worst-case error by nearly an order of magnitude through the reward’s implicit consistency regularization.

Abstract

This paper develops a deep reinforcement learning based observer control policy for autonomous bearings-only tracking of a moving target. The observer manoeuvre problem is formulated as a belief Markov decision process, where the belief state is represented by the posterior of a cubature Kalman filter (CKF). The reward function is designed to address two conflicting objectives: minimising the absolute target position estimation error (Euclidean distance) and maintaining CKF estimation consistency (Mahalanobis distance). The reward is formulated as a geometric interpolation between the two objectives on the Pareto front, parametrised by a weighting factor \beta \in [0,1]. The policy is implemented as a deep Q-network (DQN) trained over 50,000 episodes. Performance is evaluated over 5,000 Monte Carlo episodes and compared against two baselines: the perpendicular-to-bearing heuristic and the D-optimal Fisher information maximisation criterion. The results show that the DQN policy at \beta = 0.7 achieves the best trade-off between accuracy and robustness: it matches the information-theoretic baseline on mean tracking accuracy while reducing the worst-case error by nearly a factor of ten, owing to the implicit filter-consistency regularisation provided by the Mahalanobis term in the reward.