Democratizing the medieval English legal tradition

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The project digitizes early Anglo-American legal records written in abbreviated medieval Latin by creating a dataset covering 193 medieval criminal and civil cases.
It trains open-source, end-to-end neural pipelines for line segmentation and handwriting recognition, achieving 79% word accuracy with models like R-Billa and CNN+LSTM (CTC decoding).
Post-processing improves performance: adding an n-gram language model raises word accuracy to 82%, and using Gemini Pro 3 for error correction increases it to 88%.
A comparison between CNN+LSTM and TrOCR shows similar word accuracy, but TrOCR has worse character accuracy because it “guesses” more, which can make human verification harder.
The resulting system is deployed via a public web portal (glyphmachina.com) to broaden access for legal scholars, medievalists, and students.

Abstract

The record of the beginning of the most widespread legal system in the world is contained in millions of pages of handwritten text. Most of the records of the first centuries of the Anglo-American legal system are hand-written in a highly abbreviated form of medieval Latin which only a few dozen scholars in the world are trained to read. In this interdisciplinary project, we construct a dataset of 4029 lines of text across 193 medieval criminal and civil cases. We then use the dataset to train an open-source end-to-end pipeline for transcribing these manuscripts. We first train standard neural network architectures for line segmentation and handwriting recognition (R-Blla and CNN+LSTM with CTC decoding, respectively) and show that they can already achieve 79% word accuracy, despite the relatively small training set and the challenge of expanding abbreviations. We then demonstrate that simple post-processing significantly boosts accuracy: adding an n-gram language model to the CTC decoder improves word accuracy to 82%, while asking Gemini Pro 3 to correct mistakes boosts accuracy to 88%. Finally, we compare the CNN+LSTM architecture with TrOCR, a transformer-based OCR architecture, demonstrating that TrOCR shows comparable word accuracy but worse character accuracy due to its over-willingness to guess, making it harder for humans to infer the correct reading. We incorporated our pipeline into a web portal (glyphmachina.com), opening up the English legal tradition to legal scholars, medievalists, and students.