From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies Dante’s Divina Commedia by encoding text into vowel/consonant (V/C) symbols and modeling the resulting sequence as a four-state Markov chain.
  • It introduces a “graphemic memory” index that slightly but consistently increases from Inferno to Paradiso, suggesting a directional shift in local dependency structure.
  • A trigram-level analysis attributes the trend to a limited set of recurrent configurations (“graphemic probes”) that connect Markov patterns to identifiable lexical environments.
  • The work finds different probe behaviors across word boundaries versus within lexical units and shows that orthographic conventions—especially apostrophised forms—affect the signal.
  • A complementary classification approach uses cantica-specific terms as lexical anchors, showing that the three cantiche are separated but also follow a continuous trajectory across the poem.

Abstract

This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.