$AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning

arXiv cs.RO / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • AutoDrive-P^3は、VLMを自動運転のエンドツーエンド計画に用いる際の課題(CoT欠如による領域ギャップ、モジュール分断によるシナジー不足)を、知覚・予測・計画を統一した推論で解決する枠組みとして提案している。
  • Perception-Prediction-Planningを一連のP^3-CoT(structured reasoning)で結び、知覚→予測→計画の情報依存を明確化しつつ、知覚と予測の双方が最終計画に寄与する設計になっている。
  • P^3-GRPOという階層型強化学習アルゴリズムで、知覚・予測・計画の全タスクへ段階的な進捗監督(progressive supervision)を与え、CoT推論と回答生成を段階的に学習させる。
  • 推論効率と性能の両立のために「detailed thinking(詳細推論)」と「fast thinking(高速推論)」の2つの思考モードを導入し、オープンループ(nuScenes)/クローズドループ(NAVSIMv1/v2)で計画タスクのSOTA性能を報告している。
  • コードはGitHubで公開されており、研究・再現に向けた利用可能性も示されている。

Abstract

Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose {AutoDrive\text{-}P^3}, a novel framework that seamlessly integrates \textbf{P}erception, \textbf{P}rediction, and \textbf{P}lanning through structured reasoning. We introduce the {P^3\text{-}CoT} dataset to facilitate coherent reasoning and propose {P^3\text{-}GRPO}, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, {AutoDrive\text{-}P^3} progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at https://github.com/haha-yuki-haha/AutoDrive-P3.