Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation

arXiv cs.AI / 4/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses limitations of Generative Recommendation’s Semantic ID (SID) generation, including semantic information loss, semantic degradation from cascaded quantization, and text–image modality misalignment.
  • It proposes a framework combining Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and a Quality-Aware Reinforcement Mechanism (QARM) to produce higher-quality, context-preserving SIDs.
  • CMSA uses Vision-Language Models (VLMs) to map non-text modalities into a unified text-based semantic space, reducing modality distortion even when upstream models align inputs.
  • DCIM mines high-level interest/context from advertising-related signals using reconstruction-based supervision, while QARM applies reinforcement learning with quality-aware rewards to improve posterior-stage SID selection.
  • Experiments and ablation studies show consistent gains over state-of-the-art SID generation methods across multiple benchmarks, with each component contributing to the overall improvement.

Abstract

Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a novel framework integrating three key innovations: Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). First, we leverage Vision-Language Models (VLMs) to align non-textual modalities into a unified text-based semantic space, mitigating modality distortion. Second, we introduce a deep interest mining mechanism that captures high-level semantic information implicitly present in advertising contexts, encouraging SIDs to preserve critical contextual information through reconstruction-based supervision. Third, we employ a reinforcement learning framework with quality-aware rewards to encourage semantically rich SIDs while suppressing low-quality ones in the posterior stage. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art SID generation methods, achieving superior performance on multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component