SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
arXiv cs.CL / 3/24/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces SozKZ, a family of Llama-architecture small language models (50M–600M parameters) trained from scratch specifically on Kazakh using a dedicated 50K BPE tokenizer optimized for the language’s agglutinative morphology.
- SozKZ is trained on 9 billion Kazakh tokens and is evaluated on three Kazakh benchmarks (cultural MC QA, reading comprehension, and topic classification) as well as multilingual baselines up to 3B parameters.
- The 600M model reaches 30.3% accuracy on Kazakh cultural QA, closely approaching Llama-3.2-1B (32.0%) despite being much smaller, and achieves 25.5% on topic classification while outperforming evaluated multilingual models up to 2B.
- The authors report consistent scaling behavior across 50M to 600M parameters, with cultural QA accuracy improving from 22.8% to 30.3%, indicating that additional scaling may further help.
- All model weights and the tokenizer are released under open licenses, positioning the approach as a computationally efficient path for low-resource language technology.
Related Articles

Black Hat Asia
AI Business
Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to

5 Real Issues With LiteLLM That Are Pushing Teams Away in 2026
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to