EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces EHRAG, a lightweight GraphRAG framework designed to bridge semantic gaps that occur when entities are only connected via structural co-occurrence.
  • EHRAG builds a hypergraph using both structural hyperedges (from sentence-level entity co-occurrence via lightweight extraction) and semantic hyperedges (from clustering entity text embeddings), capturing latent relationships.
  • For retrieval, it uses a hybrid structure–semantic diffusion approach with topic-aware scoring and personalized PageRank (PPR) refinement to select the top-k relevant documents.
  • Experiments across four datasets show EHRAG outperforms existing baselines while keeping linear indexing complexity and requiring zero tokens for hypergraph construction.
  • The work is provided as an open-source implementation on GitHub, enabling replication and further research.

Abstract

Graph-based Retrieval-Augmented Generation (GraphRAG) enhances LLMs by structuring corpus into graphs to facilitate multi-hop reasoning. While recent lightweight approaches reduce indexing costs by leveraging Named Entity Recognition (NER), they rely strictly on structural co-occurrence, failing to capture latent semantic connections between disjoint entities. To address this, we propose EHRAG, a lightweight RAG framework that constructs a hypergraph capturing both structure and semantic level relationships, employing a hybrid structural-semantic retrieval mechanism. Specifically, EHRAG constructs structural hyperedges based on sentence-level co-occurrence with lightweight entity extraction and semantic hyperedges by clustering entity text embeddings, ensuring the hypergraph encompasses both structural and semantic information. For retrieval, EHRAG performs a structure-semantic hybrid diffusion with topic-aware scoring and personalized pagerank (PPR) refinement to identify the top-k relevant documents. Experiments on four datasets show that EHRAG outperforms state-of-the-art baselines while maintaining linear indexing complexity and zero token consumption for construction. Code is available at https://github.com/yfsong00/EHRAG.