GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

arXiv cs.CV / 4/22/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces GRAFT, a transformer-based approach to reconstruct physically plausible 3D human–scene interactions from a single image while addressing the speed–reasoning trade-off of prior methods.
  • Instead of slow optimization, GRAFT amortizes geometry-based human–scene fitting into fast feed-forward inference by predicting “interaction gradients” that iteratively correct human meshes.
  • It represents interaction state with compact body-anchored tokens derived from scene geometry using “geometric probes,” and repeatedly updates meshes while re-probing the scene.
  • GRAFT can run end-to-end from image features or be used as a plug-and-play HSI prior from geometry alone, improving other feed-forward reconstructions without retraining.
  • Experiments report up to 113% better interaction quality than state-of-the-art feed-forward baselines and ~50× lower runtime compared with optimization-based quality, with strong generalization to multi-person scenes.

Abstract

Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at {\sim}50{\times} lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .