Hi everyone,
I’m looking for benchmarks or leaderboards specifically focused on image description / image captioning quality with LLMs or VLMs.
Most of the benchmarks I find are more about general multimodal reasoning, VQA, OCR, or broad vision-language performance, but what I really want is something that evaluates how well models describe an image in natural language.
Ideally, I’m looking for things like:
- benchmark datasets for image description/captioning,
- leaderboards comparing models on this task,
- evaluation metrics commonly used for this scenario,
- and, if possible, benchmarks that are relevant to newer multimodal LLMs rather than only traditional captioning models.
My use case is evaluating models for generating spoken descriptions of images, so I’m especially interested in benchmarks that reflect useful, natural, and accurate scene descriptions.
Does anyone know good references, papers, leaderboards, or datasets for this?
I need for my research ^-^, thanks!
[link] [comments]




