Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation

arXiv cs.AI / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies situated conversational recommendation (SCR), which uses visual scenes plus dialogue to give contextually relevant recommendations that depend on evolving, implicit user preferences.
  • It proposes SiPeR (Situated Preference Reasoning), combining scene transition estimation to judge whether a scene fits the user’s needs and guide the interaction toward a better scene when needed.
  • SiPeR also uses Bayesian inverse inference that exploits multimodal large language model (MLLM) likelihoods to infer user preferences over candidate items.
  • Experiments on two benchmarks show that SiPeR improves both recommendation accuracy and response generation quality compared with existing approaches.
  • The authors provide code and data via GitHub, enabling further reproduction and research extension.

Abstract

Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real-world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users' underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.