Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

arXiv cs.CV / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a method to improve precise camera control in text-to-image generation using learned “viewpoint tokens” that parameterize camera viewpoints.
It fine-tunes existing image generation models on a curated dataset combining 3D-rendered images (for geometric supervision) with photorealistic augmentations (to diversify appearance and backgrounds).
Experiments show the approach reaches state-of-the-art accuracy in viewpoint-conditioned generation while maintaining image quality and prompt fidelity.
The authors argue that their viewpoint tokens learn factorized geometric representations, helping generalization to unseen object categories rather than overfitting to object-specific appearance cues.
The work suggests that text-vision latent spaces can be made geometrically aware by explicitly embedding 3D camera structure, enabling more controllable, geometry-aware prompts.

Abstract

Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation. Project page: https://randdl.github.io/viewtoken_control/