Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates whether and where large language models encode structured semantic relations—synonymy, antonymy, hypernymy, and hyponymy—across models of increasing scale (Pythia-70M, GPT-2, and Llama 3.1 8B).
Using linear probing together with mechanistic interpretability methods (sparse autoencoders and activation patching), the authors map the layer/pathway locations and the specific features that contribute to representing these relations.
Results show a directional asymmetry in hierarchical relations: hypernymy is redundantly represented and is difficult to suppress, while hyponymy depends on compact features that are more vulnerable to ablation.
The relation signals are described as diffuse yet stable, typically peaking in mid-layers and appearing stronger in post-residual/MLP pathways than in attention.
Probe-level causal effects vary with model capacity: SAE-guided patching produces reliable shifts on Llama 3.1 but weaker or unstable effects on smaller models, with antonymy easiest and synonymy hardest to elicit causally.

Abstract

Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.