GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

arXiv cs.CV / 4/22/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces GRAFT, a transformer-based approach to reconstruct physically plausible 3D human–scene interactions from a single image while addressing the speed–reasoning trade-off of prior methods.
Instead of slow optimization, GRAFT amortizes geometry-based human–scene fitting into fast feed-forward inference by predicting “interaction gradients” that iteratively correct human meshes.
It represents interaction state with compact body-anchored tokens derived from scene geometry using “geometric probes,” and repeatedly updates meshes while re-probing the scene.
GRAFT can run end-to-end from image features or be used as a plug-and-play HSI prior from geometry alone, improving other feed-forward reconstructions without retraining.
Experiments report up to 113% better interaction quality than state-of-the-art feed-forward baselines and ~50× lower runtime compared with optimization-based quality, with strong generalization to multi-person scenes.

Abstract

Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at

{\sim}50{\times}

lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .

Black Hat USA

AI Business

Just what the doctor ordered: how AI could help China bridge the medical resources gap

SCMP Tech

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

Got into the Anthropic Claude Partner Network — have spots for people who want CCAF cert access

Reddit r/artificial

💎 Daily B2B Lead Report: Who's Hiring Now? (2026-04-25)

Dev.to

GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

Key Points

Abstract

Related Articles

Black Hat USA

Just what the doctor ordered: how AI could help China bridge the medical resources gap

Why don't Automatic speech Recognition models use prompting? [D]

Got into the Anthropic Claude Partner Network — have spots for people who want CCAF cert access

💎 Daily B2B Lead Report: Who's Hiring Now? (2026-04-25)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer