From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study benchmarks 17 leading multimodal large language models, including frontier and open-source options, on a difficult real-world medical handwritten form requiring extraction of mixed dates, printed text, and free responses.
  • Most smaller or older models underperform, while the latest Google and OpenAI models achieve around 85% accuracy and roughly 90% weighted F1 despite the high variability and noise in handwriting.
  • Model-specific strengths are observed: GPT 5.4 shows the best performance for noisy date extraction and the lowest hallucination rate (6%), Claude Sonnet 4.6 performs best on formatted fields such as dates and numerical values, and Gemini 3.1 delivers the best overall results with the lowest free-text error rates (WER 0.50, CER 0.31) and strong discrete classification metrics.
  • Prompt optimization yields large gains in macro precision/recall/F1 (over 60%) but only modest improvements in weighted metrics (about 2–5%), suggesting the weighting scheme is less sensitive to prompt changes.
  • The results suggest multimodal LLM progress could enable highly automated digitization of complex handwritten workflows, which may be especially valuable for low- and middle-income countries where manual digitization is costly.

Abstract

Manual digitisation of structured handwritten documents is slow and costly. We benchmark 17 leading frontier multi-modal large language models and open-source models against a very challenging real-world medical form that mixes dates; structured, printed text; hand-written responses and significant variability challenges. None of the smaller or older models perform well but the latest Google and OpenAI models reach accuracies around 85\% with weighted F1 scores \simeq 90\% across the discrete or predefined fields despite the very challenging nature of the responses. Clear task specific strengths emerge: GPT 5.4 excels in noisy date extraction as well as reliability with the lowest hallucination rate (6\%). Claude Sonnet 4.6 had the best average performance across formatted fields (dates and numerical values), while Gemini 3.1 delivered the best overall performance, with the lowest free text error rates (WER = 0.50 and CER = 0.31) and the strongest results across discrete classification metrics. We further show that prompt optimisation dramatically improves macro precision, recall and F1 by over 60\%, but has little impact on weighted metrics (only \sim2-5\% improvement). These results provide evidence that the rapid improvements of multimodal large language models offer a compelling pathway toward fully automated digitisation of complex handwritten workflows that is particularly relevant in low- and middle-income countries.