Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper revisits prior negative results for language modeling with predicted semantic structure and derives empirical lower bounds on what incremental tagging quality would be required for semantic-bootstrapping to outperform a baseline.
  • It proposes a compact binary (lexical-level) vector representation of semantic structure and evaluates, in depth, how much incremental tag accuracy is needed when an end-to-end semantic-bootstrapping language model is used.
  • The authors frame the target system as a hybrid of a pretrained sequential neural component and a hierarchical-symbolic component to generate text with low surprisal and higher interpretability.
  • They find that the dimensionality of the semantic vector representation can be significantly reduced while preserving key benefits, improving practicality of the semantic-structure encoding.
  • A key methodological takeaway is that quality lower bounds cannot be inferred from a single overall score; they must explicitly consider the distributions of both the useful signal and the noise.

Abstract

In this work we build upon negative results from an attempt at language modeling with predicted semantic structure, in order to establish empirical lower bounds on what could have made the attempt successful. More specifically, we design a concise binary vector representation of semantic structure at the lexical level and evaluate in-depth how good an incremental tagger needs to be in order to achieve better-than-baseline performance with an end-to-end semantic-bootstrapping language model. We envision such a system as consisting of a (pretrained) sequential-neural component and a hierarchical-symbolic component working together to generate text with low surprisal and high linguistic interpretability. We find that (a) dimensionality of the semantic vector representation can be dramatically reduced without losing its main advantages and (b) lower bounds on prediction quality cannot be established via a single score alone, but need to take the distributions of signal and noise into account.