Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model

arXiv cs.CL / 4/17/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper trains a 318M-parameter Transformer from scratch on 1.56B tokens of Classical Chinese only (no English characters or Arabic numerals) and evaluates it with systematic out-of-distribution tests about known vs. unknown historical events.
  • Results show a strong split between internal and external uncertainty: the model’s perplexity increases markedly for fabricated and semi-fabricated events, suggesting it encodes factual information, but it fails to reliably express that uncertainty in its generated text.
  • Across multiple languages/writing systems and eight model sizes (110M–1.56B), the ability to *express* epistemic uncertainty is driven by training-data rhetorical conventions rather than genuine metacognition.
  • The authors introduce a “humility paradox” for Classical Chinese models (more hedging on known topics) and contrast it with Japanese models that almost never hedge, arguing that metacognitive “I don’t know” behavior needs explicit training signals such as RLHF.
  • The study concludes that language-model generalization can contain meaningful internal uncertainty while remaining outwardly uncalibrated, highlighting limits of classical language modeling for reliable uncertainty communication.

Abstract

We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.