Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 提案手法PaIR-Driveは、エンドツーエンド自動運転における模倣学習(IL)のデモ品質依存という制約を、並列構成での模倣学習と強化学習(RL)の共同最適化で緩和する枠組みを提示しています。
  • 従来の「IL→逐次的RL微調整」では方策ドリフトや性能天井が生じやすい点に対し、PaIR-DriveはIL/RLを2本の並列ブランチに分け、衝突しにくい学習目的で共同学習させることでこの問題を回避します。
  • 推論時にはRLブランチがIL方策を参照して最終計画をさらに最適化し、ILの事前知識を超える性能向上を狙っています。
  • さらにツリー構造の軌道サンプラを導入してGRPOを行い、探索能力を高める設計が含まれます。
  • NAVSIM v1/v2ベンチマークで、TransfuserやDiffusionDriveといったIL基盤に対してPaIR-DriveがPDMS 91.2、EPDMS 87.9の競争的性能を示し、既存のRL微調整手法を一貫して上回るほか、人間の専門家の不適切行動を修正し得ることも分析・定性的結果で示しています。

Abstract

End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving | AI Navigate