Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

arXiv cs.CV / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a method to improve precise camera control in text-to-image generation using learned “viewpoint tokens” that parameterize camera viewpoints.
  • It fine-tunes existing image generation models on a curated dataset combining 3D-rendered images (for geometric supervision) with photorealistic augmentations (to diversify appearance and backgrounds).
  • Experiments show the approach reaches state-of-the-art accuracy in viewpoint-conditioned generation while maintaining image quality and prompt fidelity.
  • The authors argue that their viewpoint tokens learn factorized geometric representations, helping generalization to unseen object categories rather than overfitting to object-specific appearance cues.
  • The work suggests that text-vision latent spaces can be made geometrically aware by explicitly embedding 3D camera structure, enabling more controllable, geometry-aware prompts.

Abstract

Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation. Project page: https://randdl.github.io/viewtoken_control/