RawGen: Learning Camera Raw Image Generation

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces RawGen, a diffusion-based framework for generating camera raw (linear, scene-referred) images from text and for inverting sRGB back to camera-specific raw representations.
  • RawGen is motivated by the difficulty of collecting large-scale raw datasets, since existing raw datasets are limited and often tied to specific camera hardware and fixed image signal processor (ISP) pipelines.
  • To produce physically meaningful linear outputs rather than photo-finished sRGB, the method uses specialized processing across latent and pixel spaces and trains on a many-to-one inverse-ISP dataset that anchors multiple ISP-varied sRGB renditions to a common scene target.
  • The authors fine-tune a conditional denoiser and a specialized decoder to better handle unknown and diverse ISP pipelines, improving camera-centric linear reconstructions compared with traditional inverse-ISP approaches.
  • They also report that RawGen can generate scalable, text-driven synthetic raw data that helps downstream low-level vision tasks beyond raw reconstruction itself.

Abstract

Cameras capture scene-referred linear raw images, which are processed by onboard image signal processors (ISPs) into display-referred 8-bit sRGB outputs. Although raw data is more faithful for low-level vision tasks, collecting large-scale raw datasets remains a major bottleneck, as existing datasets are limited and tied to specific camera hardware. Generative models offer a promising way to address this scarcity -- however, existing diffusion frameworks are designed to synthesize photo-finished sRGB images rather than physically consistent linear representations. This paper presents RawGen, to our knowledge the first diffusion-based framework enabling text-to-raw generation for arbitrary target cameras, alongside sRGB-to-raw inversion. RawGen leverages the generative priors of large-scale sRGB diffusion models to synthesize physically meaningful linear outputs, such as CIE XYZ or camera-specific raw representations, via specialized processing in latent and pixel spaces. To handle unknown and diverse ISP pipelines and photo-finishing effects in diffusion-model training data, we build a many-to-one inverse-ISP dataset where multiple sRGB renditions of the same scene generated using diverse ISP parameters are anchored to a common scene-referred target. Fine-tuning a conditional denoiser and specialized decoder on this dataset allows RawGen to obtain camera-centric linear reconstructions that effectively invert the rendering pipeline. We demonstrate RawGen's superior performance over traditional inverse-ISP methods that assume a fixed ISP. Furthermore, we show that augmenting training pipelines with RawGen's scalable, text-driven synthetic data can benefit downstream low-level vision tasks.