RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • RealBirdIDは、野外の鳥類の種同定において「答える/棄却する」を評価し、棄却時には「音声が必要」「画像品質が低い」「視界が遮られている」など根拠ベースの理由を求めるベンチマークを提案している。
  • 生成・推論能力が高いマルチモーダルLLMでも、ベンチマークの答えられるケースでの種同定精度が低く(MLLMで13%未満という結果)、依然として実用上の難しさが示されている。
  • 精度が高いモデルほど未回答(棄却)へのキャリブレーションが必ずしも改善しないこと、さらに棄却しても提示する理由が正しくないケースが多いことが報告されている。
  • ジェネラ(属)ごとに「答えられない例(根拠付き)」と「答えられる例」の検証分割を用意し、棄却認識を前提とした微調整・進捗測定のための具体的な計測枠組みを提供する。

Abstract

Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today's multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: "requires vocalization," "low quality image," or "view obstructed". For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.

RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs | AI Navigate