AI Navigate

Learning Retrieval Models with Sparse Autoencoders

arXiv cs.LG / 3/17/2026

📰 NewsModels & Research

Key Points

  • The paper introduces SPLARE, a method to train SAE-based learned sparse retrieval (LSR) models that encode queries and documents into high-dimensional sparse representations rather than projecting into the vocabulary space.
  • Sparse autoencoders are used to decompose dense LLM representations into interpretable latent features, enabling more semantically structured and language-agnostic retrieval signals.
  • Empirical results show SPLARE-based LSR consistently outperforms vocabulary-based LSR in multilingual and out-of-domain settings, with SPLARE-7B achieving top results on MMTEB multilingual and English retrieval tasks.
  • A lighter 2B-parameter variant demonstrates a smaller footprint while preserving retrieval performance, highlighting practical scalability of the approach.

Abstract

Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned Sparse Retrieval (LSR), whose objective is to encode queries and documents into high-dimensional sparse representations optimized for efficient retrieval. In contrast to existing LSR approaches that project input sequences into the vocabulary space, SAE-based representations offer the potential to produce more semantically structured, expressive, and language-agnostic features. Building on this insight, we introduce SPLARE, a method to train SAE-based LSR models. Our experiments, relying on recently released open-source SAEs, demonstrate that this technique consistently outperforms vocabulary-based LSR in multilingual and out-of-domain settings. SPLARE-7B, a multilingual retrieval model capable of producing generalizable sparse latent embeddings for a wide range of languages and domains, achieves top results on MMTEB's multilingual and English retrieval tasks. We also developed a 2B-parameter variant with a significantly lighter footprint.