SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

arXiv cs.CL / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • SIMMER proposes a single unified multimodal embedding model for cross-modal retrieval between food images and recipe texts, aiming to simplify alignment compared with dual-encoder approaches.
  • The method leverages an MLLM-based embedding framework (VLM2Vec) and uses recipe-specific prompt templates (title, ingredients, instructions) to generate effective embeddings.
  • It introduces component-aware data augmentation that trains on both complete and partial recipes to improve robustness when inputs are missing or incomplete.
  • Experiments on Recipe1M show state-of-the-art results, including improved image-to-recipe R@1 from 81.8% to 87.5% (1k) and from 56.5% to 65.5% (10k) over the previous best method.

Abstract

Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.