Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor

arXiv cs.RO / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a sample-efficient, curriculum learning (CL) method to train an end-to-end reinforcement learning policy for robust quadrotor stabilization that controls motor RPMs directly.
  • It targets simultaneous position and yaw-orientation stabilization from random initial conditions while satisfying predefined transient and steady-state performance specifications.
  • To overcome the slow, compute-intensive training of conventional one-stage end-to-end RL, the authors decompose the task into a three-stage curriculum (hovering, translational-rotational coupling, and robustness to random non-zero initial velocities) with knowledge transfer across stages.
  • The training uses a custom reward function and episode truncation conditions, and the CL-trained policy shows improved performance and robustness versus one-stage training under the same reward/hyperparameters.
  • Validation is performed in simulation (Gym-PyBullet-Drones) and in an inspection pose-tracking scenario, demonstrating reduced sample/computation needs and faster convergence, with results supported by an accompanying video.

Abstract

This article introduces a novel sample-efficient curriculum learning (CL) approach for training an end-to-end reinforcement learning (RL) policy for robust stabilization of a Quadrotor. The learning objective is to simultaneously stabilize position and yaw-orientation from random initial conditions through direct control over motor RPMs (end-to-end), while adhering to pre-specified transient and steady-state specifications. This objective, relevant in aerial inspection applications, is challenging for conventional one-stage end-to-end RL, which requires substantial computational resources and lengthy training times. To address this challenge, this article draws inspiration from human-inspired curriculum learning and decomposes the learning objective into a three-stage curriculum that incrementally increases task complexity, while transferring knowledge from one stage to the next. In the proposed curriculum, the policy sequentially learns hovering, the coupling between translational and rotational degrees of freedom, and robustness to random non-zero initial velocities, utilizing a custom reward function and episode truncation conditions. The results demonstrate that the proposed CL approach achieves superior performance compared to a policy trained conventionally in one stage, with the same reward function and hyperparameters, while significantly reducing computational resource needs (samples) and convergence time. The CL-trained policy's performance and robustness are thoroughly validated in a simulation engine (Gym-PyBullet-Drones), under random initial conditions, and in an inspection pose-tracking scenario. A video presenting our results is available at https://youtu.be/9wv6T4eezAU.