Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures
arXiv cs.AI / 4/6/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper revisits prior negative results for language modeling with predicted semantic structure and derives empirical lower bounds on what incremental tagging quality would be required for semantic-bootstrapping to outperform a baseline.
- It proposes a compact binary (lexical-level) vector representation of semantic structure and evaluates, in depth, how much incremental tag accuracy is needed when an end-to-end semantic-bootstrapping language model is used.
- The authors frame the target system as a hybrid of a pretrained sequential neural component and a hierarchical-symbolic component to generate text with low surprisal and higher interpretability.
- They find that the dimensionality of the semantic vector representation can be significantly reduced while preserving key benefits, improving practicality of the semantic-structure encoding.
- A key methodological takeaway is that quality lower bounds cannot be inferred from a single overall score; they must explicitly consider the distributions of both the useful signal and the noise.




