Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that popular non-adversarial, Q-based imitation learning methods like IQ-Learn can theoretically reduce to behavioral cloning and still exhibit a provable lower bound on the imitation gap with quadratic dependence on horizon, meaning they can still suffer from compounding errors.
  • It explains why IQ-Learn may fail to generalize: it uniformly suppresses Q-values for actions on states not covered well by demonstrations, limiting the ability to recover expert behavior outside the demonstrated state distribution.
  • To fix this, the authors propose Dual Q-DM, a primal-dual distribution-matching framework that adds Bellman constraints to propagate value information from visited states to unvisited ones.
  • The paper claims Dual Q-DM is provably equivalent to adversarial imitation learning in a way that can recover expert actions beyond demonstrations and theoretically eliminate compounding errors.
  • Theoretical guarantees are supported by experiments, which the authors say corroborate the derived claims about generalization and compounding-error mitigation.

Abstract

Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.