WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
arXiv cs.CV / 3/13/2026
📰 NewsModels & Research
Key Points
- WeEdit presents a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy for text-centric image editing.
- It introduces an HTML-based automatic editing pipeline that generates about 330K training pairs across 15 languages, enabling multilingual text editing in images.
- The framework uses glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by multi-objective reinforcement learning to improve instruction adherence, text clarity, and background preservation.
- The approach provides standardized bilingual and multilingual benchmarks for comprehensive evaluation of text-centric image editing models.
- Experiments show WeEdit outperforming previous open-source models across diverse editing operations.
Related Articles

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to
[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it
Reddit r/MachineLearning
Experiment: How far can a 28M model go in business email generation?
Reddit r/LocalLLaMA

Qwen 3.5 397b (180gb) scores 93% on MMLU
Reddit r/LocalLLaMA
Qwen 3.5 27B - quantize KV cache or not?
Reddit r/LocalLLaMA