VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper proposes VeraRetouch, a lightweight, fully differentiable multi-task reasoning framework for photo retouching that can jointly analyze defects, produce reasoning plans, and apply precise edits.
  • It uses a compact 0.5B vision-language model to generate retouching plans from instructions and scene semantics, and replaces external non-differentiable tools with a fully differentiable Retouch Renderer for end-to-end pixel-level training.
  • The Retouch Renderer is trained with decoupled control latents for lighting, global color, and targeted color adjustments, reducing optimization barriers and parameter redundancy while improving generalization.
  • To address limited data, the authors introduce AetherRetouch-1M+, a million-scale dataset for professional retouching created via a new inverse degradation workflow, and they add DAPO-AE, a reinforcement learning post-training method for better autonomous aesthetic cognition.
  • Experiments reportedly show state-of-the-art results on multiple benchmarks with a much smaller model footprint, supporting mobile deployment, and the code/models are released on GitHub.

Abstract

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.