Reinforcement Learning Trained Observer Control for Bearings-Only Tracking

arXiv cs.AI / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a deep reinforcement learning observer-control policy for autonomous bearings-only tracking of a moving target by formulating the problem as a belief Markov decision process.
The belief state is represented using the posterior of a cubature Kalman filter (CKF), linking the learned controller to state estimation uncertainty.
A reward function balances two competing goals—minimizing Euclidean position estimation error and maintaining CKF consistency measured via Mahalanobis distance—by interpolating along the Pareto front with a weighting parameter β.
The controller is trained as a deep Q-network (DQN) over 50,000 episodes and evaluated via 5,000 Monte Carlo runs, outperforming two baselines including a D-optimal Fisher information maximization criterion.
At β = 0.7, the DQN achieves the best accuracy–robustness trade-off, matching the information-theoretic baseline on mean accuracy while cutting worst-case error by nearly an order of magnitude through the reward’s implicit consistency regularization.

Abstract

This paper develops a deep reinforcement learning based observer control policy for autonomous bearings-only tracking of a moving target. The observer manoeuvre problem is formulated as a belief Markov decision process, where the belief state is represented by the posterior of a cubature Kalman filter (CKF). The reward function is designed to address two conflicting objectives: minimising the absolute target position estimation error (Euclidean distance) and maintaining CKF estimation consistency (Mahalanobis distance). The reward is formulated as a geometric interpolation between the two objectives on the Pareto front, parametrised by a weighting factor

\beta \in [0,1]

. The policy is implemented as a deep Q-network (DQN) trained over 50,000 episodes. Performance is evaluated over 5,000 Monte Carlo episodes and compared against two baselines: the perpendicular-to-bearing heuristic and the D-optimal Fisher information maximisation criterion. The results show that the DQN policy at

\beta = 0.7

achieves the best trade-off between accuracy and robustness: it matches the information-theoretic baseline on mean tracking accuracy while reducing the worst-case error by nearly a factor of ten, owing to the implicit filter-consistency regularisation provided by the Mahalanobis term in the reward.

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

Dev.to

AI is getting better at doing things, but still bad at deciding what to do?

Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

Reinforcement Learning Trained Observer Control for Bearings-Only Tracking

Key Points

Abstract

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

AI is getting better at doing things, but still bad at deciding what to do?

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer