Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition
arXiv cs.LG / 3/19/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study systematically compares two strategies—sub-word tokenization (Bigram) and concatenation-based data augmentation—in IMU-based online handwriting recognition to address inter-writer and intra-writer variability.
- On the writer-independent split, Bigram tokenization improves generalization to unseen styles, lowering WER from 15.40% to 12.99%.
- On the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts, while concatenation-based data augmentation acts as a strong regularizer, reducing character error rate by 34.5% and WER by 25.4%.
- Short, low-level tokens are found to benefit model performance, and concatenation-based augmentation can outperform proportionally extended training.
- The results reveal a variance-dependent effect: tokenization mitigates inter-writer variability, whereas concatenation-based augmentation addresses intra-writer distribution sparsity, guiding technique choice by data distribution.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to