WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
arXiv cs.CV / 3/13/2026
📰 NewsModels & Research
Key Points
- WeEdit presents a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy for text-centric image editing.
- It introduces an HTML-based automatic editing pipeline that generates about 330K training pairs across 15 languages, enabling multilingual text editing in images.
- The framework uses glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by multi-objective reinforcement learning to improve instruction adherence, text clarity, and background preservation.
- The approach provides standardized bilingual and multilingual benchmarks for comprehensive evaluation of text-centric image editing models.
- Experiments show WeEdit outperforming previous open-source models across diverse editing operations.
Related Articles

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER
Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine
Reddit r/LocalLLaMA
Today, what hardware to get for running large-ish local models like qwen 120b ?
Reddit r/LocalLLaMA
Running mistral locally for meeting notes and it's honestly good enough for my use case
Reddit r/LocalLLaMA
[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data
Reddit r/MachineLearning