GaussFly: Contrastive Reinforcement Learning for Visuomotor Policies in 3D Gaussian Fields

arXiv cs.RO / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • GaussFlyは、単眼視覚のみで自律ドローン(AAV)のビジュオモータ政策を学習する課題に対し、表現学習と方策最適化を分離する枠組みを提案しています。
  • 実環境から高忠実度にシミュレーションへ移すために、3D Gaussian Splatting(3DGS)に幾何学的制約を加えて訓練シーンを再構成する「real-to-sim-to-real」パラダイムを採用しています。
  • さらに、フォトリアルなレンダリング環境で対照学習(contrastive representation learning)を行い、ノイズに強いコンパクトな潜在特徴を抽出してから政策に入力することで、計算負荷とロバスト性を同時に高めます。
  • 実験ではシミュレーションおよび実世界で、既存手法よりサンプル効率と漸近性能が向上し、複雑なテクスチャを持つ未見環境へのロバストなゼロショット転移を実現したと報告しています。

Abstract

Learning visuomotor policies for Autonomous Aerial Vehicles (AAVs) relying solely on monocular vision is an attractive yet highly challenging paradigm. Existing end-to-end learning approaches directly map high-dimensional RGB observations to action commands, which frequently suffer from low sample efficiency and severe sim-to-real gaps due to the visual discrepancy between simulation and physical domains. To address these long-standing challenges, we propose GaussFly, a novel framework that explicitly decouples representation learning from policy optimization through a cohesive real-to-sim-to-real paradigm. First, to achieve a high-fidelity real-to-sim transition, we reconstruct training scenes using 3D Gaussian Splatting (3DGS) augmented with explicit geometric constraints. Second, to ensure robust sim-to-real transfer, we leverage these photorealistic simulated environments and employ contrastive representation learning to extract compact, noise-resilient latent features from the rendered RGB images. By utilizing this pre-trained encoder to provide low-dimensional feature inputs, the computational burden on the visuomotor policy is significantly reduced while its resistance against visual noise is inherently enhanced. Extensive experiments in simulated and real-world environments demonstrate that GaussFly achieves superior sample efficiency and asymptotic performance compared to baselines. Crucially, it enables robust and zero-shot policy transfer to unseen real-world environments with complex textures, effectively bridging the sim-to-real gap.