SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

arXiv cs.CL / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SCRIPT, a model-agnostic injection module designed to add Korean character (Jamo) compositional knowledge to Korean pre-trained language models that currently rely on subword tokenization.
  • SCRIPT enhances subword embeddings with structural granularity while requiring no architectural changes or additional pre-training, making it broadly applicable to existing PLMs.
  • Experiments reportedly improve performance across multiple Korean NLU and NLG tasks compared with various baselines.
  • Additional linguistic analyses suggest SCRIPT modifies the embedding space to better reflect grammatical regularities and produce semantically cohesive variations.
  • The authors provide the implementation at the linked GitHub repository, supporting adoption and reproducibility.

Abstract

Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.