IndoBERT-Relevancy: A Context-Conditioned Relevancy Classifier for Indonesian Text

arXiv cs.CL / 3/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces IndoBERT-Relevancy, a context-conditioned classifier designed to judge whether a candidate Indonesian text is relevant to a given topical context.
It is built on IndoBERT Large (335M parameters) and trained on a newly created dataset of 31,360 labeled (topic, text) pairs across 188 topics.
The authors use an iterative, failure-driven dataset construction approach and find that no single data source provides sufficient coverage for robust relevancy classification.
They add targeted synthetic data to address specific weaknesses, achieving an F1 score of 0.948 and 96.5% accuracy on both formal and informal Indonesian.
The resulting model is released publicly on HuggingFace for reuse in relevancy-filtering and related NLP pipelines.

Abstract

Determining whether a piece of text is relevant to a given topic is a fundamental task in natural language processing, yet it remains largely unexplored for Bahasa Indonesia. Unlike sentiment analysis or named entity recognition, relevancy classification requires the model to reason about the relationship between two inputs simultaneously: a topical context and a candidate text. We introduce IndoBERT-Relevancy, a context-conditioned relevancy classifier built on IndoBERT Large (335M parameters) and trained on a novel dataset of 31,360 labeled pairs spanning 188 topics. Through an iterative, failure-driven data construction process, we demonstrate that no single data source is sufficient for robust relevancy classification, and that targeted synthetic data can effectively address specific model weaknesses. Our final model achieves an F1 score of 0.948 and an accuracy of 96.5%, handling both formal and informal Indonesian text. The model is publicly available at HuggingFace.