REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations

arXiv cs.LG / 4/21/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces REALM, an unsupervised fine-tuning approach that accounts for crowdworkers’ varying expertise instead of treating all annotations as equal via majority vote or averaging.
REALM models each annotator’s observed labels as a mixture of the model’s own prediction and uniform random guessing, with weights determined by a learned scalar expertise per annotator.
It further extends REALM to multi-task fine-tuning using a learned expertise matrix to capture annotator reliability differences across tasks.
Experiments on five QA benchmarks fine-tuning three Flan-T5 sizes under simulated noisy annotations show consistent improvements over naive noisy supervised fine-tuning, including up to ~50% accuracy gains in the most adversarial settings.
The benefits are reported to increase with model capacity and to hold across datasets, model sizes, and multiple noise types, suggesting robustness for real-world noisy annotation pipelines.

Abstract

Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to

50\%

in the most adversarial regime and gains that grow with model capacity.

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

Where is Grok-2 Mini and Grok-3 (mini)?

Reddit r/LocalLLaMA

REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations

Key Points

Abstract

Related Articles

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Where is Grok-2 Mini and Grok-3 (mini)?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer