Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

arXiv cs.CL / 5/5/2026

📰 NewsModels & Research

共有:

Key Points

The paper tackles misinformation detection challenges in Bangla, where progress is constrained by small and imbalanced datasets.
It proposes an LLM-based dataset augmentation framework that generates synthetic Bangla news using the instruction-tuned Gemma 3 27B IT model, with semantic filtering and subsampling to maintain label consistency and diversity.
The study compares zero-shot vs. few-shot prompting, multiple augmentation rates, and random vs. similarity-based selection, finding the best results when only the minority class is augmented with a high rate plus random subsampling.
Using this approach, the Fake News F1 score improves from 0.85 to 0.88.
To enable reproducibility, the authors publicly release 4,545 synthetic Bangla fake news samples and the full implementation.

Abstract

The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets. This study investigates whether Large Language Model (LLM) based augmentation can effectively address this limitation and improve Bangla fake news classification. Existing datasets remain valuable but highly imbalanced, limiting model performance, and LLM based augmentation for Bangla has been scarcely explored. To fill this gap, we propose a systematic augmentation framework that generates synthetic Bangla news articles using the instruction tuned Gemma 3 27B IT model, supported by semantic filtering and controlled subsampling to preserve label consistency and diversity. We compare zero shot and few shot prompting, evaluate multiple augmentation rates, and examine random versus similarity-based selection strategies. Our experiments show that augmenting only the minority class with a high augmentation rate and random subsampling yields the strongest gains, raising the Fake News F1 score from 0.85 to 0.88. To support reproducibility and further research in this low-resource domain, we publicly release 4,545 synthetically generated Bangla fake news samples along with our full implementation. These findings demonstrate that well-designed LLM-driven augmentation can significantly improve fake news detection in low resource settings and provide a practical foundation for advancing multilingual misinformation research.

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM

Dev.to

Nano Banana Pro vs DALL-E 3 vs Midjourney: A Practical Comparison From Someone Who Actually Uses All Three

Dev.to

LLMs edited 86 human essays toward a semantic cluster not occupied by any human writer [D]

Reddit r/MachineLearning

Fake News Detection using Machine Learning & NLP!

Dev.to

Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

Key Points

Abstract

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM

Nano Banana Pro vs DALL-E 3 vs Midjourney: A Practical Comparison From Someone Who Actually Uses All Three

LLMs edited 86 human essays toward a semantic cluster not occupied by any human writer [D]

Fake News Detection using Machine Learning & NLP!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer