Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Key Points

Swiss-Bench SBP-002 is introduced as a trilingual benchmark (German, French, Italian) covering 395 expert-crafted items across FINMA, Legal-CH, and EFK Swiss regulatory domains, with seven task types.

Abstract

While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.

Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Key Points

Abstract

Related Articles

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

How to Use MiMo V2 API for Free in 2026: Complete Guide

王明阳

mistralai/Voxtral-4B-TTS-2603 · Hugging Face

ByteDance’s new AI video generation model, Dreamina Seedance 2.0, comes to CapCut

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Related Articles

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.
Mistral AI Blog

How to Use MiMo V2 API for Free in 2026: Complete Guide
Dev.to

mistralai/Voxtral-4B-TTS-2603 · Hugging Face
Reddit r/LocalLLaMA

ByteDance’s new AI video generation model, Dreamina Seedance 2.0, comes to CapCut
TechCrunch