Wiki Dumps to Training Corpora: South Slavic Case
arXiv cs.CL / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a two-phase methodology to convert raw Wikimedia dumps into high-quality text corpora for seven South Slavic languages.
- It first extracts and cleans text from multiple Wikimedia projects (e.g., Wikipedia and related sites), carefully handling wiki markup to isolate real articles and usable natural-language text.
- It then filters out suspicious or low-quality articles by using an n-gram–based approach to detect high textual redundancy across articles and remove those entries from the final datasets.
- The resulting corpora are intended to support training language models and comparative linguistic research, while the authors argue the approach is largely language-agnostic and generalizable.
- Overall, the work emphasizes reliable, high-information corpus creation that better reflects authentic language use and cultural context.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to

An API testing tool built specifically for AI agent loops
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to