3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale

arXiv cs.CV / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper proposes 3DAlign-DAER, a unified framework for fine-grained text-to-3D geometry alignment that addresses poor semantic-geometric matching and performance collapse on large-scale 3D databases.
It introduces a Dynamic Attention Policy (DAP) that uses a Hierarchical Attention Fusion (HAF) module to learn token-to-point attentions, further calibrated with Monte Carlo tree search and a hybrid reward signal.
For inference on large datasets, 3DAlign-DAER adds an Efficient Retrieval Strategy (ERS) that performs hierarchical search in embedding spaces, improving accuracy and efficiency over approaches like KNN.
To enable training and research, the authors build Align3D-2M with 2 million text–3D pairs, and report that extensive experiments show superior results across multiple benchmarks.
The authors plan to release codes, models, and datasets to support further work on text–3D alignment.

Abstract

Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets. Our code and updates are available at https://github.com/waltstephen/Cost-Effective-Communication.

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

MarkTechPost

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

Claude Code 会话历史在哪里？如何找回你的 AI 编程对话记录

Dev.to

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

Reddit r/artificial

langchain-tests==1.1.7

LangChain Releases

3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale

Key Points

Abstract

Related Articles

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

An improvement of the convergence proof of the ADAM-Optimizer

Claude Code 会话历史在哪里？如何找回你的 AI 编程对话记录

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

langchain-tests==1.1.7

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer