Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

arXiv cs.LG / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study systematically compares two strategies—sub-word tokenization (Bigram) and concatenation-based data augmentation—in IMU-based online handwriting recognition to address inter-writer and intra-writer variability.
On the writer-independent split, Bigram tokenization improves generalization to unseen styles, lowering WER from 15.40% to 12.99%.
On the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts, while concatenation-based data augmentation acts as a strong regularizer, reducing character error rate by 34.5% and WER by 25.4%.
Short, low-level tokens are found to benefit model performance, and concatenation-based augmentation can outperform proportionally extended training.
The results reveal a variance-dependent effect: tokenization mitigates inter-writer variability, whereas concatenation-based augmentation addresses intra-writer distribution sparsity, guiding technique choice by data distribution.

Abstract

Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but remains challenged by uneven character distributions and inter-writer variability. In this work, we systematically investigate two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Our experiments on the OnHW-Words500 dataset reveal a clear dichotomy between handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction via Bigram tokenization significantly improves performance to unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, on the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts between the training and validation sets. Instead, our proposed concatenation-based data augmentation acts as a powerful regularizer, reducing the character error rate by 34.5% and the WER by 25.4%. Further analysis shows that short, low-level tokens benefit model performance and that concatenation-based data augmentation performance gain surpasses those achieved by proportionally extended training. These findings reveal a clear variance-dependent effect: sub-word tokenization primarily mitigates inter-writer stylistic variability, whereas concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity.

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Reddit r/MachineLearning

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Dev.to

Complete Guide: How To Make Money With Ai

Dev.to

I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+

Dev.to

The Demethylation

Dev.to

Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

Key Points

Abstract

Related Articles

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Complete Guide: How To Make Money With Ai

I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+

The Demethylation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer