ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

arXiv cs.CL / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces Adaptive Dictionary Embeddings (ADE), a framework that scales multi-anchor word representations—previously too inefficient for large models—into large language model architectures.
  • ADE’s core components include Vocabulary Projection to replace expensive anchor lookups with efficient matrix operations, Grouped Positional Encoding to share position information among anchors of the same word, and self-attention-based context-aware anchor reweighting.
  • ADE is integrated into a Segment-Aware Transformer (SAT) to perform context-aware anchor weighting during inference.
  • On AG News and DBpedia-14, ADE shows strong parameter efficiency (98.7% fewer trainable parameters than DeBERTa-v3-base), surpasses DeBERTa on DBpedia-14, and approaches DeBERTa on AG News while compressing the embedding layer by over 40×.
  • Overall, the results suggest multi-anchor representations can be a practical, parameter-efficient alternative to single-vector word embeddings in modern transformers.

Abstract

Word embeddings are fundamental to natural language processing, yet traditional approaches represent each word with a single vector, creating representational bottlenecks for polysemous words and limiting semantic expressiveness. While multi-anchor representations have shown promise by representing words as combinations of multiple vectors, they have been limited to small-scale models due to computational inefficiency and lack of integration with modern transformer architectures. We introduce Adaptive Dictionary Embeddings (ADE), a framework that successfully scales multi-anchor word representations to large language models. ADE makes three key contributions: (1) Vocabulary Projection (VP), which transforms the costly two-stage anchor lookup into a single efficient matrix operation; (2) Grouped Positional Encoding (GPE), a novel positional encoding scheme where anchors of the same word share positional information, preserving semantic coherence while enabling anchor-level variation; and (3) context-aware anchor reweighting, which leverages self-attention to dynamically compose anchor contributions based on sequence context. We integrate these components into the Segment-Aware Transformer (SAT), which provides context-aware reweighting of anchor contributions at inference time. We evaluate ADE on AG News and DBpedia-14 text classification benchmarks. With 98.7% fewer trainable parameters than DeBERTa-v3-base, ADE surpasses DeBERTa on DBpedia-14 (98.06% vs. 97.80%) and approaches it on AG News (90.64% vs. 94.50%), while compressing the embedding layer over 40x -- demonstrating that multi-anchor representations are a practical and parameter-efficient alternative to single-vector embeddings in modern transformer architectures.