Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
arXiv cs.CV / 4/6/2026
💬 OpinionSignals & Early TrendsModels & Research
Key Points
- The paper introduces Efficient3D, a unified framework for accelerating 3D Multimodal Large Language Models (3D MLLMs) via adaptive and debiased visual token pruning to reduce inference cost on constrained hardware.
- Efficient3D adds a Debiased Visual Token Importance Estimator (DVTIE) that accounts for the influence of shallow attention layers to produce more reliable token-importance scores.
- It further proposes Adaptive Token Rebalancing (ATR), which changes pruning strength dynamically based on scene complexity to preserve semantic completeness and maintain attention balance across layers.
- Across five 3D vision-language benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D), the method improves over unpruned baselines, including a +2.57% CIDEr gain on Scan2Cap.
- The authors report releasing the associated code on GitHub, supporting reproducibility and practical experimentation with the framework for efficient 3D MLLM inference.




