FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
arXiv cs.CV / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces FoodSense, a human-annotated multisensory food dataset designed to predict taste, smell, texture, and sound from images rather than only perform recognition tasks.
- FoodSense covers 2,987 unique food images with 66,842 participant-image pairs, providing 1–5 numeric ratings plus free-text descriptors for four sensory dimensions.
- It also adds image-grounded reasoning traces by using a large language model to generate visual justifications conditioned on the image and the sensory annotations, enabling both prediction and explanation.
- The authors train FoodSense-VL, a vision-language benchmark model that outputs multisensory ratings and grounded explanations directly from food images.
- The work argues that common evaluation metrics are often inadequate for visually inferring multisensory experiences, and positions the approach as a bridge between cognitive science and multimodal instruction tuning.
Related Articles
langchain-anthropic==1.4.1
LangChain Releases

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development
Dev.to

Stop burning tokens on DOM noise: a Playwright MCP optimizer layer
Dev.to

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs
Dev.to

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.
Dev.to