FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

arXiv cs.CV / 4/13/2026

📰 NewsModels & Research

Key Points

  • The paper introduces FashionStylist, an expert-annotated multimodal benchmark aimed at holistic fashion understanding that combines visual perception with style and rationale reasoning.
  • The dataset is built via a dedicated fashion-expert annotation pipeline and includes professionally grounded labels at both item and full-outfit levels.
  • FashionStylist supports three tasks—outfit-to-item grounding, outfit completion, and outfit evaluation—covering complex item recovery (layering/accessories), compatibility-aware composition (beyond co-occurrence), and expert scoring of style/season/occasion/coherence.
  • Experiments indicate the benchmark functions as a unified training/evaluation resource and improves performance for grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.

Abstract

Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.