On-Policy Distillation of Language Models for Autonomous Vehicle Motion Planning

arXiv cs.RO / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how to deploy LLM-based autonomous vehicle motion planners on resource-constrained onboard systems by distilling knowledge from a large teacher model to a smaller student model.
  • It builds on GPT-Driver, treating driving scene understanding as language prompting and waypoint trajectory generation using chain-of-thought reasoning.
  • Two student training approaches are compared: on-policy generalized knowledge distillation (GKD) using dense token-level feedback from the teacher on the student’s own outputs, and a dense-feedback reinforcement learning (RL) baseline using teacher log-probabilities as per-token reward signals.
  • Experiments on the nuScenes benchmark show that on-policy GKD significantly outperforms the RL baseline and achieves performance close to teacher-level while using a model that is about 5× smaller.
  • The authors conclude that on-policy distillation is a principled and effective method for making LLM-based planners practical for real autonomous driving deployments.

Abstract

Large language models (LLMs) have recently demonstrated strong potential for autonomous vehicle motion planning by reformulating trajectory prediction as a language generation problem. However, deploying capable LLMs in resource-constrained onboard systems remains a fundamental challenge. In this paper, we study how to effectively transfer motion planning knowledge from a large teacher LLM to a smaller, more deployable student model. We build on the GPT-Driver framework, which represents driving scenes as language prompts and generates waypoint trajectories with chain-of-thought reasoning, and investigate two student training paradigms: (i) on-policy generalized knowledge distillation (GKD), which trains the student on its own self-generated outputs using dense token-level feedback from the teacher, and (ii) a dense-feedback reinforcement learning (RL) baseline that uses the teacher's log-probabilities as per-token reward signals in a policy gradient framework. Experiments on the nuScenes benchmark show that GKD substantially outperforms the RL baseline and closely approaches teacher-level performance despite a 5\times reduction in model size. These results highlight the practical value of on-policy distillation as a principled and effective approach to deploying LLM-based planners in autonomous driving systems.