Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

arXiv cs.CL / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper addresses relation completion (RC) in cases where the needed information is rare or sparsely expressed, noting that LLMs often struggle even when using retrieval-augmented generation (RAG).- It introduces RC-RAG, a multi-stage paraphrase-guided framework that injects relation paraphrases at several points: during retrieval to broaden lexical coverage, in retrieval-based summarization to make summaries relation-aware, and during generation to guide reasoning.- RC-RAG improves robustness in long-tail settings without requiring any model fine-tuning, making it easier to adopt across different LLMs.- Experiments on two benchmark datasets using five LLMs show consistent gains over multiple RAG baselines, including a reported +40.6 EM improvement for the best LLM in long-tail scenarios.- The authors report low computational overhead while achieving these improvements, suggesting the approach can be practically deployed alongside existing RAG pipelines.

Abstract

Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.