VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

研究は、大規模視覚言語モデル（GPT-4o、Claude 3.5 Sonnet、LLaMA 3.2 Vision）を対象に、顔画像からの年齢推定を“ゼロショット”で評価し、UTKFaceとFG-NETの2ベンチマークで微調整なしの性能を検証した。
MAE/MSE/RMSE/MAPE/MBE/R²/CCC/±5年精度など8つの指標を用いて、汎用LVLMが従来のドメイン特化学習に対して競争力のある結果を出し得ることを示した。
画像品質や人口統計サブグループに起因する性能差が観察され、年齢推定でも公平性を意識したマルチモーダル推論の必要性を指摘した。
prompt感度、解釈可能性、計算コスト、人口統計の公平性といった未解決課題を残しつつ、再現可能なベンチマークとして現場応用（法科学・ヘルスケア監視・HCI）に向けた足場を提供している。

Abstract

Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE,

R^2

, CCC, and

\pm

5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

Key Points

Abstract

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer