Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces Efficient3D, a unified framework for accelerating 3D Multimodal Large Language Models (3D MLLMs) via adaptive and debiased visual token pruning to reduce inference cost on constrained hardware.
Efficient3D adds a Debiased Visual Token Importance Estimator (DVTIE) that accounts for the influence of shallow attention layers to produce more reliable token-importance scores.
It further proposes Adaptive Token Rebalancing (ATR), which changes pruning strength dynamically based on scene complexity to preserve semantic completeness and maintain attention balance across layers.
Across five 3D vision-language benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D), the method improves over unpruned baselines, including a +2.57% CIDEr gain on Scan2Cap.
The authors report releasing the associated code on GitHub, supporting reproducibility and practical experimentation with the framework for efficient 3D MLLM inference.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D