VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models
arXiv cs.AI / 3/16/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- VLM4Rec reframes multimodal recommendation from simple feature fusion to semantic alignment by grounding each item image into an explicit natural-language description using a large vision-language model.
- It then encodes these grounded semantics into dense item representations and uses a profile-based semantic matching mechanism for recommendation, enabling an offline-online decomposition.
- Experiments on multiple multimodal datasets indicate VLM4Rec consistently improves over raw visual features and fusion-based approaches, suggesting representation quality matters more than fusion complexity.
- The authors release the code at https://github.com/tyvalencia/enhancing-mm-rec-sys to facilitate replication and practical use.




