From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study benchmarks 17 leading multimodal large language models, including frontier and open-source options, on a difficult real-world medical handwritten form requiring extraction of mixed dates, printed text, and free responses.
Most smaller or older models underperform, while the latest Google and OpenAI models achieve around 85% accuracy and roughly 90% weighted F1 despite the high variability and noise in handwriting.
Model-specific strengths are observed: GPT 5.4 shows the best performance for noisy date extraction and the lowest hallucination rate (6%), Claude Sonnet 4.6 performs best on formatted fields such as dates and numerical values, and Gemini 3.1 delivers the best overall results with the lowest free-text error rates (WER 0.50, CER 0.31) and strong discrete classification metrics.
Prompt optimization yields large gains in macro precision/recall/F1 (over 60%) but only modest improvements in weighted metrics (about 2–5%), suggesting the weighting scheme is less sensitive to prompt changes.
The results suggest multimodal LLM progress could enable highly automated digitization of complex handwritten workflows, which may be especially valuable for low- and middle-income countries where manual digitization is costly.

Abstract

Manual digitisation of structured handwritten documents is slow and costly. We benchmark 17 leading frontier multi-modal large language models and open-source models against a very challenging real-world medical form that mixes dates; structured, printed text; hand-written responses and significant variability challenges. None of the smaller or older models perform well but the latest Google and OpenAI models reach accuracies around

85\%

with weighted F1 scores

\simeq 90\%

across the discrete or predefined fields despite the very challenging nature of the responses. Clear task specific strengths emerge: GPT 5.4 excels in noisy date extraction as well as reliability with the lowest hallucination rate (

6\%

). Claude Sonnet 4.6 had the best average performance across formatted fields (dates and numerical values), while Gemini 3.1 delivered the best overall performance, with the lowest free text error rates (WER =

0.50

and CER =

0.31

) and the strongest results across discrete classification metrics. We further show that prompt optimisation dramatically improves macro precision, recall and F1 by over

60\%

, but has little impact on weighted metrics (only

\sim2-5\%

improvement). These results provide evidence that the rapid improvements of multimodal large language models offer a compelling pathway toward fully automated digitisation of complex handwritten workflows that is particularly relevant in low- and middle-income countries.

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms

Key Points

Abstract

Related Articles

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer