Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders

arXiv cs.CL / 5/5/2026

📰 NewsModels & Research

Key Points

  • The paper introduces EPIC, an embedding-based in-context prompt training method that uses continuous embeddings instead of discrete text demonstrations to reduce token overhead from ICL.
  • EPIC leverages contrastive learning to align semantically related text pairs while training the model to treat demonstration embeddings as part of the in-context prompt.
  • Models trained with EPIC maintain strong embedding performance both when in-context prompts are provided and when they are omitted at inference.
  • Experiments on the MTEB benchmark show EPIC achieves new state-of-the-art results, outperforming frontier models trained only on publicly available retrieval data.
  • The authors’ ablation studies support that the proposed mechanism is both effective and necessary for the observed gains.

Abstract

Large language models (LLMs) have been widely explored for embedding generation. While recent studies show that in-context learning (ICL) effectively enhances the representational capability of LLMs by prepending a few task-related demonstrations, it causes substantial token overhead due to the increased sequence length. In this work, we propose EPIC, a novel embedding-based in-context prompt training strategy that leverages ICL to generate high-quality embeddings while reducing computational burden during both training and inference. This approach replaces discrete text demonstrations with their corresponding continuous embeddings, which not only encourages the LLM to align semantically-related text pairs during contrastive learning, but also requires the model to interpret demonstration embeddings as part of the in-context prompt. Consequently, EPIC-trained models achieve excellent embedding performance both with or without in-context prompts at inference time. Comprehensive experiments demonstrate that our method establishes new state-of-the-art results on the MTEB benchmark, surpassing frontier models trained solely on publicly available retrieval data. Extensive ablation studies further validate the effectiveness and necessity of our mechanism.