REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations

arXiv cs.LG / 4/21/2026

📰 NewsModels & Research

Key Points

  • The paper introduces REALM, an unsupervised fine-tuning approach that accounts for crowdworkers’ varying expertise instead of treating all annotations as equal via majority vote or averaging.
  • REALM models each annotator’s observed labels as a mixture of the model’s own prediction and uniform random guessing, with weights determined by a learned scalar expertise per annotator.
  • It further extends REALM to multi-task fine-tuning using a learned expertise matrix to capture annotator reliability differences across tasks.
  • Experiments on five QA benchmarks fine-tuning three Flan-T5 sizes under simulated noisy annotations show consistent improvements over naive noisy supervised fine-tuning, including up to ~50% accuracy gains in the most adversarial settings.
  • The benefits are reported to increase with model capacity and to hold across datasets, model sizes, and multiple noise types, suggesting robustness for real-world noisy annotation pipelines.

Abstract

Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to 50\% in the most adversarial regime and gains that grow with model capacity.