Association Is Not Similarity: Learning Corpus-Specific Associations for Multi-Hop Retrieval

arXiv cs.CL / 4/24/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper proposes Association-Augmented Retrieval (AAR), which reranks dense retrieval candidates using learned, corpus-specific associative relationships rather than relying solely on embedding similarity.
  • AAR uses a small 4.2M-parameter MLP trained with contrastive learning on co-occurrence annotations to score bidirectional associations between passages during inference.
  • On HotpotQA, AAR raises passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation-set tuning, with the largest gains on hard questions (+28.5 points); it also improves MuSiQue by +10.1 points in the transductive setting.
  • Experiments indicate the approach is not broadly transferable: an inductive variant trained on training-split associations shows no significant improvement on unseen validation associations, and ablations confirm that using true association pairs (not just semantic similarity) is critical.
  • The method is lightweight and practical, adding about 3.7ms per query, training in under two minutes on a single GPU, and requiring no LLM-based indexing, while retrieval improvements translate to +6.4 exact match in downstream QA.

Abstract

Dense retrieval systems rank passages by embedding similarity to a query, but multi-hop questions require passages that are associatively related through shared reasoning chains. We introduce Association-Augmented Retrieval (AAR), a lightweight transductive reranking method that trains a small MLP (4.2M parameters) to learn associative relationships between passages in embedding space using contrastive learning on co-occurrence annotations. At inference time, AAR reranks an initial dense retrieval candidate set using bi-directional association scoring. On HotpotQA, AAR improves passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation-set tuning, with gains concentrated on hard questions where the dense baseline fails (+28.5 points). On MuSiQue, AAR achieves +10.1 points in the transductive setting. An inductive model trained on training-split associations and evaluated on unseen validation associations shows no significant improvement, suggesting that the method captures corpus-specific co-occurrences rather than transferable patterns. Ablation studies support this interpretation: training on semantically similar but non-associated passage pairs degrades retrieval below the baseline, while shuffling association pairs causes severe degradation. A downstream QA evaluation shows retrieval gains translate to +6.4 exact match improvement. The method adds 3.7ms per query, trains in under two minutes on a single GPU, and requires no LLM-based indexing.