Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

arXiv cs.LG / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates three sample selection strategies for biomedical time-series annotation—random sampling (RND), farthest-first traversal (FAFT), and a 2D-visualization user interface method (2DV)—using real human annotators under limited annotation budgets.
  • Across four classification tasks (infant motility assessment and speech emotion recognition), 2DV delivers the best overall results when labels are aggregated across annotators.
  • For infant motility assessment, 2DV is especially effective at capturing rare classes, but its label distribution variability can reduce model performance when training on individual annotators’ labels, where FAFT performs better.
  • In speech emotion recognition, 2DV outperforms other methods for expert annotators and achieves similar performance to experts even when considering individual-annotator label sets for non-experts.
  • Risk analysis suggests RND is the safest option when annotator number or expertise is uncertain, while 2DV carries the highest failure risk due to higher variability; interviews also found 2DV makes annotation more engaging.
  • The authors conclude that 2DV-based sampling is promising for biomedical time-series labeling, particularly when annotation budgets are not extremely tight.

Abstract

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.