Polish phonology and morphology through the lens of distributional semantics

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper uses distributional semantics to test whether Polish phonological and morphonological structure is mirrored in semantic embedding space, focusing on consonant clusters and complex word forms.
  • Experiments with techniques like t-SNE, Linear Discriminant Analysis, and discriminative learning show that embeddings encode not only morphosyntactic features (case, gender, number, tense, aspect) but also sub-lexical information such as phoneme-string patterns.
  • The study reports that phonotactic complexity, morphotactic transparency, and available morphosyntactic categories can be predicted from embeddings without explicitly using the surface forms.
  • It argues that a discriminative lexicon model built on embeddings can support highly accurate predictions for comprehension and production, due to strong structural correspondences between semantic and form spaces.

Abstract

This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the form properties of words containing consonant clusters and their meanings. Is the phonological and morphonological structure of complex words mirrored in semantic space? We address these questions for Polish, a language characterized by non-trivial morphology and an impressive inventory of morphologically-motivated consonant clusters. We use statistical and computational techniques, such as t-SNE, Linear Discriminant Analysis and Linear Discriminative Learning, and demonstrate that -- apart from encoding rich morphosyntactic information (e.g. tense, number, case) -- semantic vectors capture information on sub-lexical linguistic units such as phoneme strings. First, phonotactic complexity, morphotactic transparency, and a wide range of morphosyntactic categories available in Polish (case, gender, aspect, tense, number) can be predicted from embeddings without requiring any information about the forms of words. Second, we argue that computational modelling with the discriminative lexicon model using embeddings can provide highly accurate predictions for comprehension and production, exactly because of the existence of extensive information in semantic space that is to a considerable extent isomorphic with structure in the form space.