Linear Image Generation by Synthesizing Exposure Brackets

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes text-to-linear-image generation, producing scene-referred linear images that preserve full dynamic range for more faithful professional editing than typical display-referred outputs.
  • It argues that existing generative models mostly synthesize display-referred images, which limits downstream edits because dynamic range is compressed and stylization is applied.
  • To overcome difficulties with pretrained VAEs in latent diffusion (especially preserving both extreme highlights and shadows), the method represents a linear image as multiple exposure brackets covering different parts of the dynamic range.
  • The approach uses a DiT-based flow-matching architecture to generate exposure brackets conditioned on text, and it demonstrates downstream uses like text-guided linear editing and ControlNet-style structure conditioning.

Abstract

The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.

Linear Image Generation by Synthesizing Exposure Brackets | AI Navigate