Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Mix3R is a new generative 3D reconstruction approach that unifies feed-forward pixel-aligned reconstruction with generative 3D priors to improve multi-view alignment and pose estimation.
The method builds 3D outputs in two stages—sparse voxel generation followed by textured geometry generation—while jointly producing coarse 3D structure, per-view point maps, and camera parameters that are aligned to that structure.
Mix3R uses a Mixture-of-Transformers architecture that injects global self-attention into both a pretrained feed-forward reconstruction model and a pretrained 3D generative model to retain priors while improving 2D-3D alignment.
It introduces an overlap-based attention bias derived from the initial aligned sparse voxels and point maps, which is applied to a textured geometry generator for training-free texture placement.
Compared with prior pure generative and feed-forward methods, Mix3R reports better input alignment in the reconstructed 3D shapes and more accurate camera pose estimates.

Abstract

Recent trends in sparse-view 3D reconstruction have taken two different paths: feed-forward reconstruction that predicts pixel-aligned point maps without a complete geometry, and generative 3D reconstruction that generates complete geometry but often with poor input-alignment. We present Mix3R, a novel generative 3D reconstruction method which mixes feed-forward reconstruction and 3D generation into a single framework in an aligned manner. Mix3R generates a 3D shape in two stages: a sparse voxel generation stage and a textured geometry generation stage. Unlike pure generative methods, our first-stage generation jointly produces a coarse 3D structure (sparse voxels), per-view point maps and camera parameters aligned to that 3D structure. This is made possible by introducing a Mixture-of-Transformers architecture that inserts global self-attentions to a feed-forward reconstruction model and a 3D generative model, both pretrained on large-scale data. This design effectively retains the pretrained priors but enables better 2D-3D alignment. Based on the initial aligned generations of sparse 3D voxels and point maps, we compute an overlap-based attention bias that is directly added to another pretrained textured geometry generation model, enabling it to correctly place input textures onto generated shapes in a training-free manner. Our design brings mutual benefits to both feed-forward reconstruction and 3D generation: The feed-forward branch learns to ground its predictions to a generative 3D prior, and conversely, the 3D generation branch is conditioned on geometrically informative features from the feed-forward branch. As a result, our method produces 3D shapes with better input alignment compared with pure 3D generative methods, together with camera pose estimations more accurate than previous feed-forward reconstruction methods. Our project page is at https://jsnln.github.io/mix3r/

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

Solidity LM surpasses Opus

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer