Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations

arXiv cs.CL / 4/17/2026

💬 OpinionModels & Research

共有:

Key Points

The paper investigates domain fine-tuning of the Finnish BERT model using Finnish medical text to address NLP classification settings with limited labeled data.
It documents observations from fine-tuning on Finnish histopathological reports and evaluates how this domain adaptation affects downstream performance.
The authors attempt to predict the benefit of domain-specific pre-training by analyzing how the embedding space geometry changes during domain fine-tuning.
The work is motivated by healthcare AI scenarios where collecting new datasets—particularly labeled data—can take significant time.
Overall, it connects practical domain adaptation for medical NLP with a more analytical method for anticipating gains from domain pre-training.

Abstract

In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.