Benchmark for Assessing Olfactory Perception of Large Language Models

arXiv cs.AI / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 提案された「Olfactory Perception (OP) benchmark」は、LLMが匂い(嗅覚)を推論できるかを評価するための1,010問・8カテゴリのベンチマークである。
  • 課題は匂いの分類、主要記述語の特定、強度・心地よさ判断、混合の類似性、受容体活性推定、実世界の匂い源からの同定など多岐にわたる。
  • 21のモデル構成の評価から、化合物名プロンプトの方が異性体SMILES表現より常に高性能で、改善幅は+2.4〜+18.9ポイント(平均約+7ポイント)となり、現状の知識獲得は構造的な分子推論より語彙的関連に依存する可能性が示唆された。
  • 最良モデルは全体精度64.4%を達成した一方で、嗅覚推論には大きなギャップが残ること、また21言語の部分評価では言語横断の予測集約が有効でAUROC=0.86(最良の言語アンサンブル)と報告されている。

Abstract

Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.