Curation of a Palaeohispanic Dataset for Machine Learning

arXiv cs.AI / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article proposes building a structured machine-learning-ready dataset to support research on Palaeohispanic languages of the Iberian Peninsula before Roman arrival.
  • It notes that existing computational opportunities are constrained by limited resources and that current materials are often in unsuitable formats for ML techniques.
  • It frames the dataset as enabling computational and data-driven linguistic analysis despite the fact that none of the Palaeohispanic languages is fully deciphered.
  • The work positions a more practical, curated data format as a foundation for future progress in the field.

Abstract

Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after G\'omez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.