Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

arXiv cs.CL / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper presents Bangla Key2Text, a large-scale dataset containing 2.6M Bangla keyword–text pairs for keyword-driven text generation in a low-resource setting.
  • The dataset is built from millions of Bangla news articles using a BERT-based keyword extraction pipeline to convert raw articles into supervised training examples.
  • The authors fine-tune two sequence-to-sequence models, mT5 and BanglaT5, to create baselines on this new benchmark.
  • Results indicate that task-specific fine-tuning significantly improves keyword-conditioned generation in Bangla versus zero-shot large language models.
  • The dataset, trained models, and code are released publicly to enable further research on Bangla NLG and keyword-to-text generation.

Abstract

This paper introduces \textit{Bangla Key2Text}, a large-scale dataset of 2.6 million Bangla keyword--text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword--text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, \texttt{mT5} and \texttt{BanglaT5}, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.