P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

arXiv cs.CV / 4/20/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces Prototypical Point-level Prompt Tuning (P3T), a parameter-efficient way to adapt pre-trained 3D vision-language models to downstream point-cloud tasks without the cost of full fine-tuning.
  • P3T uses two learned components—a Point Prompter that creates instance-aware point-level prompts from the input cloud and a Text Prompter that injects learnable prompts into the text—reducing reliance on handcrafted prompts.
  • To improve generalization, the method adds a prototypical loss aimed at aligning embedding spaces by lowering intra-category variance.
  • Experiments on classification and few-shot learning show performance on par with or better than full fine-tuning, including strong robustness to data shifts across datasets.
  • The authors release code on GitHub, enabling others to reproduce and build on the approach for 3D VLM adaptation.

Abstract

With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P^3T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P^3T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P^3T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{https://github.com/gyjung975/P3T}.