IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation
arXiv cs.CV / 3/12/2026
📰 NewsModels & Research
Key Points
- IMTBench introduces a new benchmark for end-to-end in-image machine translation, featuring 2,500 samples across four scenarios and nine languages.
- It evaluates translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text and the rendered image.
- The study benchmarks commercial cascade systems as well as closed- and open-source multi-modal models, revealing large performance gaps across scenarios and languages, especially for natural scenes and resource-limited languages.
- The authors aim to standardize benchmarking to accelerate progress in end-to-end image text translation.
Related Articles

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to
[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it
Reddit r/MachineLearning
Experiment: How far can a 28M model go in business email generation?
Reddit r/LocalLLaMA

Qwen 3.5 397b (180gb) scores 93% on MMLU
Reddit r/LocalLLaMA
Qwen 3.5 27B - quantize KV cache or not?
Reddit r/LocalLLaMA