EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

arXiv cs.CL / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureIndustry & Market MovesModels & Research

Key Points

  • The paper proposes EPM-RL, a reinforcement-learning framework to perform on-premise e-commerce product mapping (matching listings that refer to the same product) despite noisy seller-provided text like promotional keywords and bundles.
  • EPM-RL reduces reliance on costly external agentic LLM pipelines by distilling high-cost reasoning into a trainable in-house model using parameter-efficient fine-tuning (PEFT) on a student model trained with structured reasoning outputs.
  • It then applies reinforcement learning with an agent-based reward that simultaneously checks output-format compliance, correct matching labels, and reasoning preferences scored by purpose-built judge models.
  • Preliminary results indicate EPM-RL improves consistently over PEFT-only training and achieves a better quality–cost trade-off than commercial API-based baselines, while enabling privacy-preserving private deployment.
  • The approach aims to transform product mapping from a high-latency, hard-to-operate agentic pipeline into a scalable, inspectable, production-ready in-house system.

Abstract

Product mapping, the task of deciding whether two e-commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform-specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM-based and multi-agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference-time orchestration, making large-scale deployment costly and difficult in privacy-sensitive enterprise settings. To address these issues, we present EPM-RL, a reinforcement-learning-based framework for building an accurate and efficient on-premise e-commerce product mapping model. Our central idea is to distill high-cost agentic reasoning into a trainable in-house model. Starting from a curated set of product pairs with LLM-generated rationales and human verification, we first perform parameter-efficient fine-tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent-based reward that jointly evaluates output-format compliance, label correctness, reasoning--preference scores from specially designed judge models. Preliminary results show that EPM-RL consistently improves over PEFT-only training and offers a stronger quality--cost trade-off than commercial API-based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high-latency agentic pipeline into a scalable, inspectable, and production-ready in-house system.