ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that current image editing models struggle to support precise, intuitive control because users must specify both exact spatial layouts and detailed semantics, which natural language and freehand scribbles alone cannot fully provide.
It introduces ScribbleEdit, a large-scale synthetic dataset that pairs human-drawn scribbles with VLM-generated text instructions to better train models to interpret both modalities together.
The dataset is built via a synthetic pipeline that generates source–target image pairs using inpainting, then associates them with scribbles and text instructions.
Experiments show that off-the-shelf unified multimodal image editing models perform poorly with abstract scribbles, but fine-tuning on ScribbleEdit improves spatial alignment and semantic consistency.
The work evaluates and fine-tunes both diffusion-based and autoregressive unified multimodal image editing models using the proposed dataset.

Abstract

Recent progress in generative models has significantly advanced image editing capabilities, yet precise and intuitive user control remains difficult. Specifically, users often struggle to communicate both exact spatial layouts and specific semantic details simultaneously. While natural language instructions effectively convey high-level semantics like texture and color, they lack spatial specificity. Conversely, freehand scribbles provide rough spatial boundaries but cannot express detailed visual attributes. Consequently, achieving precise control requires combining both modalities. However, existing models struggle to jointly interpret abstract scribbles alongside text due to a lack of specialized training data. In this work, we introduce ScribbleEdit, a large-scale synthetic dataset designed to bridge this gap by combining natural language instructions with freehand scribble inputs for more accurate, controllable edits. We construct this dataset through a synthetic pipeline that automatically generates source-target image pairs via inpainting, which are then paired with human-drawn scribbles and VLM-generated text instructions. Using ScribbleEdit, we evaluate and finetune both diffusion-based and autoregressive unified multimodal image editing models. Our experiments reveal that while off-the-shelf models struggle with abstract scribble inputs, finetuning on our synthetic dataset significantly improves their ability to generate spatially aligned and semantically consistent edits.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/5DailyView insight →

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

Dev.to

Building an AI Image Generator SaaS in 2026: My Tech Stack and Lessons

Dev.to

ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text

Key Points

Abstract

💡 Insights using this article

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Meta will use AI to analyze height and bone structure to identify if users are underage

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

Building an AI Image Generator SaaS in 2026: My Tech Stack and Lessons

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer