AI Navigate

Long-Context Encoder Models for Polish Language Understanding

arXiv cs.CL / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a Polish encoder-only model capable of processing sequences of up to 8192 tokens, addressing the short-context limitation of traditional BERT-like encoders.
  • It uses a two-stage training procedure—positional embedding adaptation followed by full parameter continuous pre-training—along with compressed variants via knowledge distillation to balance performance and efficiency.
  • Evaluations across 25 tasks, including KLEJ and FinBench, show the model achieves the best average performance among Polish and multilingual models on long-context tasks while preserving short-text quality.
  • The work, released as arXiv:2603.12191v1 under the 'new' announce type, highlights meaningful progress for long-document understanding in Polish and multilingual NLP.

Abstract

While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.