Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition
arXiv cs.LG / 3/19/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study systematically compares two strategies—sub-word tokenization (Bigram) and concatenation-based data augmentation—in IMU-based online handwriting recognition to address inter-writer and intra-writer variability.
- On the writer-independent split, Bigram tokenization improves generalization to unseen styles, lowering WER from 15.40% to 12.99%.
- On the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts, while concatenation-based data augmentation acts as a strong regularizer, reducing character error rate by 34.5% and WER by 25.4%.
- Short, low-level tokens are found to benefit model performance, and concatenation-based augmentation can outperform proportionally extended training.
- The results reveal a variance-dependent effect: tokenization mitigates inter-writer variability, whereas concatenation-based augmentation addresses intra-writer distribution sparsity, guiding technique choice by data distribution.
Related Articles
[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning
Reddit r/MachineLearning
How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails
Dev.to
Complete Guide: How To Make Money With Ai
Dev.to
I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+
Dev.to
The Demethylation
Dev.to