An Adapter-free Fine-tuning Approach for Tuning 3D Foundation Models

arXiv cs.CV / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Momentum-Consistency Fine-Tuning (MCFT), an adapter-free fine-tuning method for 3D point cloud foundation models to improve adaptation in low-data (few-shot) regimes.
  • MCFT fine-tunes only part of the pre-trained encoder while applying a momentum-based consistency constraint to reduce representation drift and overfitting compared with full fine-tuning.
  • The approach keeps the original model parameter count and does not add new trainable components beyond a standard task head, avoiding the inference-time latency costs common in adapter-based PEFT.
  • Two extensions are proposed: a semi-supervised variant that leverages unlabeled data for stronger few-shot performance and a pruning-based variant that increases computational efficiency via structured layer removal.
  • Experiments on object recognition and part segmentation benchmarks show consistent gains (e.g., +3.30% in 5-shot and up to +6.13% with semi-supervised learning) while remaining practical for resource-constrained deployment.

Abstract

Point cloud foundation models demonstrate strong generalization, yet adapting them to downstream tasks remains challenging in low-data regimes. Full fine-tuning often leads to overfitting and significant drift from pre-trained representations, while existing parameter-efficient fine-tuning (PEFT) methods mitigate this issue by introducing additional trainable components at the cost of increased inference-time latency. We propose Momentum-Consistency Fine-Tuning (MCFT), an adapter-free approach that bridges the gap between full and parameter-efficient fine-tuning. MCFT selectively fine-tunes a portion of the pre-trained encoder while enforcing a momentum-based consistency constraint to preserve task-agnostic representations. Unlike PEFT methods, MCFT introduces no additional representation learning parameters beyond a standard task head, maintaining the original model's parameter count and inference efficiency. We further extend MCFT with two variants: a semi-supervised framework that leverages abundant unlabeled data to enhance few-shot performance, and a pruning-based variant that improves computational efficiency through structured layer removal. Extensive experiments on object recognition and part segmentation benchmarks demonstrate that MCFT consistently outperforms prior methods, achieving a 3.30% gain in 5-shot settings and up to a 6.13% improvement with semi-supervised learning, while remaining well-suited for resource-constrained deployment.