Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses long-horizon offline goal-conditioned reinforcement learning, highlighting that existing hierarchical methods (e.g., HIQL) struggle due to limited Gaussian-policy expressiveness and weak subgoal generation by high-level policies.
  • It proposes a goal-conditioned mean flow policy that models an average velocity field for both high-level and low-level components, enabling efficient one-step action sampling.
  • To improve goal representation quality, the authors add a LeJEPA loss that repels goal-embedding vectors during training, aiming to produce more discriminative representations and better generalization.
  • Experiments on the OGBench benchmark show the method delivers strong results on both state-based and pixel-based tasks, indicating broader applicability beyond low-dimensional environments.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) is a practical reinforcement learning paradigm that aims to learn goal-conditioned policies from reward-free offline data. Despite recent advances in hierarchical architectures such as HIQL, long-horizon control in offline GCRL remains challenging due to the limited expressiveness of Gaussian policies and the inability of high-level policies to generate effective subgoals. To address these limitations, we propose the goal-conditioned mean flow policy, which introduces an average velocity field into hierarchical policy modeling for offline GCRL. Specifically, the mean flow policy captures complex target distributions for both high-level and low-level policies through a learned average velocity field, enabling efficient action generation via one-step sampling. Furthermore, considering the insufficiency of goal representation, we introduce a LeJEPA loss that repels goal representation embeddings during training, thereby encouraging more discriminative representations and improving generalization. Experimental results show that our method achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark.