SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SAVGO, a reinforcement learning method that uses a geometry-aware objective to shape policy updates in continuous action spaces using value-based similarity.
  • SAVGO learns a joint state-action embedding space where action pairs with similar action-value estimates are mapped to directions with high cosine similarity, while dissimilar pairs are separated in the embedding geometry.
  • Using this learned geometry, the method builds a similarity kernel over candidate actions at each update, steering policy improvement toward higher-value regions beyond what local gradient steps achieve.
  • The approach unifies representation learning, value estimation, and policy optimization under a single geometry-consistent objective, while maintaining the scalability benefits of off-policy actor-critic training.
  • Experiments on MuJoCo continuous-control benchmarks show improved performance over strong baselines, with ablation studies supporting the contributions of value-geometry learning and similarity-driven policy updates.

Abstract

While representation and similarity learning have improved the sample efficiency of Reinforcement Learning (RL), they are rarely used to shape policy updates directly in the action space. To bridge this gap, a geometry-aware RL algorithm that explicitly incorporates value-based similarity into the policy update, State-Action Value Geometry Optimization (SAVGO), is proposed. In detail, SAVGO learns a joint state-action embedding space in which pairs with similar action-value estimates exhibit high cosine similarity, while dissimilar pairs are mapped to distinct directions. This learned geometry enables the generation of a similarity kernel over candidate actions sampled at each update, allowing policy improvement to be guided directly toward higher-value regions beyond local gradient-based updates. As a result, representation learning, value estimation, and policy optimization are unified within a single geometry-consistent objective, while preserving the scalability of off-policy actor-critic training. The proposed method is evaluated on standard MuJoCo continuous-control benchmarks, demonstrating improvements over strong baselines on challenging high-dimensional tasks. Ablation studies are done to analyze the contributions of value-geometry learning and similarity-based policy updates.