Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

arXiv cs.CV / 4/9/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Appear2Meaning, a cross-cultural benchmark aimed at inferring structured cultural metadata (such as creator, origin, and period) from images rather than producing only free-form captions.
It evaluates vision-language models using an LLM-as-Judge approach that scores semantic alignment with reference annotations.
Performance is assessed with exact-match, partial-match, and attribute-level accuracy, revealing that models often rely on fragmented visual signals.
Results show significant variation across cultural regions and metadata types, with predictions that are inconsistent and only weakly grounded.

Abstract

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer