From Unstructured to Structured: LLM-Guided Attribute Graphs for Entity Search and Ranking

arXiv cs.CL / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes an LLM-guided method to build structured attribute graphs for entity search in e-commerce, addressing limitations of embedding-only approaches in capturing context-specific relevance.
It uses a two-stage pipeline: an offline phase that extracts product attributes from unstructured text and builds reusable, category-aware graph schemas, and an online phase that ranks candidates via graph-aware LLM reasoning.
The approach ranks candidates using structured representations instead of raw text, cutting per-product token usage by 57% while improving ranking precision.
In zero-shot experiments, the method outperforms several baselines and achieves over a 5% improvement in average precision without requiring training data, while generalizing across diverse product categories.
The authors conclude the technique has strong potential for real-world deployment due to its efficiency and robustness.

Abstract

Entity search, i.e., finding the most similar entities to a query entity, faces unique challenges in e-commerce, where product similarity varies across categories and contexts. Traditional embedding-based approaches often struggle to capture nuanced context-specific attribute relevance. In this paper, we present a two-stage approach combining Large Language Model (LLM)-driven attribute graph construction with graph-aware LLM ranking. In the offline stage, we extract structured product attributes from unstructured text, and construct a reusable attribute graph with category-aware schemas. In the online stage, we rank retrieved candidates by reasoning over this structured representation rather than raw text, reducing per-product token usage by 57% while improving ranking precision. Experiments show that our approach outperforms multiple baselines under zero-shot scenarios, achieving a over 5% improvement in average precision without requiring training data, generalizes robustly across diverse product categories, and shows immense potential for real-world deployment.