Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

arXiv cs.CV / 3/18/2026

📰 NewsModels & Research

共有:

Key Points

CTRL-S proposes chain-of-thought reinforcement learning for SVG generation to explicitly expose the model's reasoning during output.
It introduces SVG-Sophia, a 145k-sample dataset across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks to support structured reasoning.
The framework uses the GRPO algorithm and a multi-reward objective including DINO, image-text similarity, format, and code-efficiency rewards to guide learning.
Joint multi-task training improves structural coherence, output quality of SVG code, and visual fidelity compared to prior methods.

Abstract

With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Reddit r/LocalLLaMA

Today, what hardware to get for running large-ish local models like qwen 120b ?

Reddit r/LocalLLaMA

Running mistral locally for meeting notes and it's honestly good enough for my use case

Reddit r/LocalLLaMA

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

Reddit r/MachineLearning

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

Key Points

Abstract

Related Articles

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Today, what hardware to get for running large-ish local models like qwen 120b ?

Running mistral locally for meeting notes and it's honestly good enough for my use case

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer