FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

arXiv cs.CV / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FoodSense, a human-annotated multisensory food dataset designed to predict taste, smell, texture, and sound from images rather than only perform recognition tasks.
  • FoodSense covers 2,987 unique food images with 66,842 participant-image pairs, providing 1–5 numeric ratings plus free-text descriptors for four sensory dimensions.
  • It also adds image-grounded reasoning traces by using a large language model to generate visual justifications conditioned on the image and the sensory annotations, enabling both prediction and explanation.
  • The authors train FoodSense-VL, a vision-language benchmark model that outputs multisensory ratings and grounded explanations directly from food images.
  • The work argues that common evaluation metrics are often inadequate for visually inferring multisensory experiences, and positions the approach as a bridge between cognitive science and multimodal instruction tuning.

Abstract

Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.

FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images | AI Navigate