3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale

arXiv cs.CV / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper proposes 3DAlign-DAER, a unified framework for fine-grained text-to-3D geometry alignment that addresses poor semantic-geometric matching and performance collapse on large-scale 3D databases.
  • It introduces a Dynamic Attention Policy (DAP) that uses a Hierarchical Attention Fusion (HAF) module to learn token-to-point attentions, further calibrated with Monte Carlo tree search and a hybrid reward signal.
  • For inference on large datasets, 3DAlign-DAER adds an Efficient Retrieval Strategy (ERS) that performs hierarchical search in embedding spaces, improving accuracy and efficiency over approaches like KNN.
  • To enable training and research, the authors build Align3D-2M with 2 million text–3D pairs, and report that extensive experiments show superior results across multiple benchmarks.
  • The authors plan to release codes, models, and datasets to support further work on text–3D alignment.

Abstract

Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets. Our code and updates are available at https://github.com/waltstephen/Cost-Effective-Communication.