SommBench: Assessing Sommelier Expertise of Language Models
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- SommBench is a multilingual benchmark designed to evaluate language models' sommelier expertise across languages, aiming to separate wine knowledge from general language ability.
- It comprises three tasks—Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP)—with datasets of 1,024 WTQA questions, 1,000 WFC examples, and 1,000 FWP instances collected in English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish.
- The benchmark was developed with input from a professional sommelier and native speakers to ensure realism and linguistic coverage, allowing cross-language comparison of wine expertise and grounding.
- Results show strong performance on WTQA for some models (up to 97% accuracy for closed-weights models) but substantially lower performance on WFC (peak around 65%) and weak food-wine pairing (MCC between 0 and 0.39), highlighting gaps in sensory-grounded reasoning.
- SommBench is publicly available on GitHub, with reported results for models like Gemini 2.5, GPT-OSS, and Qwen 3, establishing it as a challenging benchmark for evaluating sommelier-style reasoning in language models.




