Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests whether BERT token embeddings encode fictional narrative semantics—time, space, causality, and character—using a token-level probing setup with LLM-assisted annotation.
  • A linear probe on BERT embeddings reaches 94% accuracy and a macro-average recall of 0.83 (with balanced class weighting), outperforming a variance-matched random-embedding baseline (47%).
  • Performance is weaker for rarer narrative dimensions, especially space (recall = 0.66) and causality (recall = 0.75), indicating uneven representation strength across dimensions.
  • The analysis finds “Boundary Leakage,” where rare dimensions are often misclassified as “others,” and unsupervised clustering aligns near-randomly with the predefined categories (ARI = 0.081), implying the dimensions are not sharply discretely separable.
  • The authors propose future work such as POS-only baselines, expanded datasets, and layer-wise probing to separate syntactic effects from narrative encoding.

Abstract

Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics -- time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus "others." A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals "Boundary Leakage," where rare dimensions are systematically misclassified as "others." Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.