AI Navigate

Are there any benchmarks or leaderboards for image description with LLMs?

Reddit r/LocalLLaMA / 3/12/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post seeks benchmarks or leaderboards focused on image captioning quality when using LLMs or VLMs, not general multimodal tasks.
  • It aims to find benchmark datasets, leaderboards, and metrics that evaluate natural language image descriptions.
  • It wants benchmarks relevant to newer multimodal LLMs, not only traditional captioning models, especially for spoken-description use cases.
  • It invites references, papers, and datasets from the community for research purposes.

Hi everyone,

I’m looking for benchmarks or leaderboards specifically focused on image description / image captioning quality with LLMs or VLMs.

Most of the benchmarks I find are more about general multimodal reasoning, VQA, OCR, or broad vision-language performance, but what I really want is something that evaluates how well models describe an image in natural language.

Ideally, I’m looking for things like:

  • benchmark datasets for image description/captioning,
  • leaderboards comparing models on this task,
  • evaluation metrics commonly used for this scenario,
  • and, if possible, benchmarks that are relevant to newer multimodal LLMs rather than only traditional captioning models.

My use case is evaluating models for generating spoken descriptions of images, so I’m especially interested in benchmarks that reflect useful, natural, and accurate scene descriptions.

Does anyone know good references, papers, leaderboards, or datasets for this?

I need for my research ^-^, thanks!

submitted by /u/Blue_Horizon97
[link] [comments]