Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that geometric hallucinations in Point-Vision-Language Models arise mainly from reinforcement-learning misalignment, where sparse 3D-related tokens are overwhelmed by noisy, sequence-level reward signals.
It introduces “Geometric Reward Credit Assignment,” which splits holistic supervision into field-specific signals and assigns them only to the responsible token spans to produce more precise gradient updates.
The method also adds a “Reprojection-Consistency” term that functions as a cross-modal physical constraint verifier to penalize physically impossible 3D geometries.
Experiments on a ShapeNetCore-derived calibrated benchmark show large gains, including improving 3D KPA from 0.64 to 0.93, raising 3D bounding-box IoU to 0.686, and increasing reprojection consistency to 0.852 while preserving strong 2D localization.
Overall, the work aims to move from merely plausible text outputs toward spatial predictions that are physically verifiable and reliable for embodied agents.

Abstract

Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer