AceTone: Bridging Words and Colors for Conditional Image Grading

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • AceTone is presented as a new multimodal, unified framework for conditional image color grading that can be driven by both text prompts and reference images.
  • The method reformulates color grading as a generative transformation task that outputs 3D-LUTs, using a VQ-VAE tokenizer to compress LUTs into 64 discrete tokens while maintaining ΔE<2 fidelity.
  • The authors introduce the AceTone-800K large-scale dataset and train a vision-language model to predict LUT tokens, then apply reinforcement learning to better match perceptual fidelity and aesthetic preferences.
  • Experiments reportedly show state-of-the-art performance on text-guided and reference-guided grading, including up to a 50% improvement in LPIPS versus prior methods.
  • Human evaluations indicate the generated color styles are visually pleasing and stylistically coherent, positioning AceTone as a step toward language-driven, aesthetics-aligned color grading.

Abstract

Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a 3\times32^3 LUT vector to 64 discrete tokens with \Delta E<2 fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.

AceTone: Bridging Words and Colors for Conditional Image Grading | AI Navigate