Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction

arXiv cs.CL / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that decoding full sentence-level language structure from non-invasive EEG is fundamentally constrained by low signal-to-noise ratio and limited information bandwidth.
  • It proposes a “semantic compression” hypothesis: EEG primarily encodes a compressed set of semantic anchors rather than complete linguistic structure.
  • To match the intrinsic information capacity of EEG, Brain-CLIPLM uses a two-stage approach: contrastive learning for semantic anchor extraction and a retrieval-grounded LLM with Chain-of-Thought reasoning for sentence reconstruction.
  • On the Zurich Cognitive Language Processing Corpus, the method reports 67.55% top-5 and 85.00% top-25 sentence retrieval accuracy, outperforming a direct decoding baseline.
  • Cross-subject testing and control analyses (including permutation tests) indicate that EEG representations contain sentence-specific information beyond language-model priors, supporting a data-efficient pathway for non-invasive brain-computer interfaces.

Abstract

Decoding natural language from non-invasive electroencephalography (EEG) remains fundamentally limited by low signal-to-noise ratio and restricted information bandwidth. This raises a fundamental question regarding whether sentence-level linguistic structure can be reliably recovered from such signals. In this work, we suggest that this assumption may not hold under realistic information constraints, and instead propose a semantic compression hypothesis in which EEG signals encode a compressed set of semantic anchors rather than full linguistic structure. Under our new perspective, direct sentence reconstruction becomes an overparameterized objective relative to the intrinsic information capacity of EEG. To address this mismatch, we introduce Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic anchor extraction via contrastive learning and sentence reconstruction using a retrieval-grounded large language model (LLM) with Chain-of-Thought (CoT) reasoning, following a granularity matching principle that aligns decoding complexity with neural information capacity. Evaluated on the Zurich Cognitive Language Processing Corpus, Brain-CLIPLM achieves 67.55\% top-5 and 85.00\% top-25 sentence retrieval accuracy, significantly outperforming direct decoding baseline, while cross-subject evaluation confirms robust generalization. Control analyses, including permutation testing, further demonstrate that EEG-derived representations carry sentence-specific information beyond language model priors. These results suggest that EEG-to-text decoding is better framed as recovering compressed semantic content rather than reconstructing full sentences, providing a biologically grounded and data-efficient pathway for non-invasive brain-computer interfaces.