Image Score: Learning and Evaluating Human Preferences for Mercari Search

arXiv cs.CV / 5/4/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageIndustry & Market MovesModels & Research

共有:

Key Points

Mercari tackles image quality assessment in its C2C search experience by addressing the difficulty of aligning implicit feedback (e.g., image quality signals) with true human preferences.
The company proposes a cost-efficient weak-supervision approach that uses an LLM with chain-of-thought prompting to generate image aesthetics labels that better correlate with e-commerce user behavior.
Using LLM-produced labels improves the explainability of deep image quality evaluation, which supports customer journey optimization on Mercari.
Experiments show that the LLM-derived labels correlate with user behavior, and online A/B-style experimentation results in significant sales growth on Mercari’s web platform.
The approach is positioned as convenient for proof-of-concept testing because it reduces reliance on expensive explicit human judgments.

Abstract

Mercari is the largest C2C e-commerce marketplace in Japan, having more than 20 million active monthly users. Search being the fundamental way to discover desired items, we have always had a substantial amount of data with implicit feedback. Although we actively take advantage of that to provide the best service for our users, the correlation of implicit feedback for such tasks as image quality assessment is not trivial. Many traditional lines of research in Machine Learning (ML) are similarly motivated by the insatiable appetite of Deep Learning (DL) models for well-labelled training data. Weak supervision is about leveraging higher-level and/or noisier supervision over unlabeled data. Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We present how we leverage a Chain-of-Thought (CoT) to enable LLM to produce image aesthetics labels that correlate well with human behavior in e-commerce settings. Leveraging LLMs is more cost-effective compared to explicit human judgment, while significantly improving the explainability of deep image quality evaluation which is highly important for customer journey optimization at Mercari. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings, which is very convenient for proof-of-concept testing. We show that our LLM-produced labels correlate with user behavior on Mercari. Finally, we show our results from an online experimentation, where we achieved a significant growth in sales on the web platform.