LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines

arXiv cs.CL / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a tradeoff in text classification by combining pretrained language model (PLM) semantic strength with the interpretability of Tsetlin Machines (TMs).
  • It introduces an LLM-guided semantic bootstrapping pipeline where, for each class label, an LLM generates sub-intents that drive synthetic data creation via a three-stage curriculum (seed, core, enriched).
  • A Non-Negated Tsetlin Machine (NTM) is trained to extract high-confidence, interpretable literals that serve as semantic cues derived from the LLM.
  • By injecting these learned cues into real data, the TM can better align clause-level logic with LLM-inferred semantics without needing embeddings or runtime LLM calls.
  • Experiments across multiple text classification tasks show improved interpretability and accuracy over vanilla TMs, reaching performance comparable to BERT while remaining fully symbolic and efficient.

Abstract

Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.