Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a new method to generate geometrically consistent multi-view scenes from a single freehand sketch, despite sketches being highly ambiguous and spatially distorted inputs.
  • It introduces three main contributions: a curated ~9k sketch-to-multiview dataset, Parallel Camera-Aware Attention Adapters (CA3) for geometric inductive bias in a video transformer, and a Sparse Correspondence Supervision Loss (CSL) using Structure-from-Motion.
  • The proposed framework produces all views in a single denoising process without reference images, iterative refinement, or per-scene optimization, aiming to reduce both complexity and cost.
  • Experiments report substantial gains over two-stage baselines, including over 60% improvement in realism measured by FID, 23% better geometric consistency (Corr-Acc), and up to 3.7× faster inference.

Abstract

We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of \sim9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7\times inference speedup.