HUKUKBERT: Domain-Specific Language Model for Turkish Law

arXiv cs.CL / 4/7/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

HukukBERT is introduced as a domain-specific language model for Turkish legal NLP, trained on an 18GB cleaned Turkish legal corpus using hybrid Domain-Adaptive Pre-Training (DAPT).
The paper details a targeted pretraining approach combining multiple masking strategies (Whole-Word, Token Span, Word Span, and Keyword masking) plus a 48K WordPiece tokenizer, and compares the results against both general and existing Turkish legal models.
On a newly proposed Legal Cloze Test benchmark for Turkish court decisions, HukukBERT reaches 84.40% Top-1 accuracy and sets state-of-the-art performance.
For downstream structural segmentation of official Turkish court decisions, the model achieves a 92.8% document pass rate, also reporting a new state-of-the-art.
The authors release HukukBERT with the goal of enabling future Turkish legal NLP research such as named-entity recognition, judgment prediction, and legal document classification.

Abstract

Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark -- a masked legal term prediction task designed for Turkish court decisions -- HukukBERT achieves state-of-the-art performance with 84.40\% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated HukukBERT in the downstream task of structural segmentation of official Turkish court decisions, where it achieves a 92.8\% document pass rate, establishing a new state-of-the-art. We release HukukBERT to support future research in Turkish legal NLP tasks, including recognition of named entities, prediction of judgment, and classification of legal documents.

Black Hat Asia

AI Business

Building EchoKernel: A Voice-Controlled AI Agent That Actually Does Things

Dev.to

A Black-Box Framework for Evaluating Trust in AI Agents

Dev.to

[D] Will Google’s TurboQuant algorithm hurt AI demand for memory chips? [D]

Reddit r/MachineLearning

Friend gave me his old T620. I put a couple Tesla P40s in it today

Reddit r/LocalLLaMA

HUKUKBERT: Domain-Specific Language Model for Turkish Law

Key Points

Abstract

Related Articles

Black Hat Asia

Building EchoKernel: A Voice-Controlled AI Agent That Actually Does Things

A Black-Box Framework for Evaluating Trust in AI Agents

[D] Will Google’s TurboQuant algorithm hurt AI demand for memory chips? [D]

Friend gave me his old T620. I put a couple Tesla P40s in it today

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer