BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

arXiv cs.CL / 4/16/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

BenGER is introduced as an open-source, collaborative web platform that supports end-to-end benchmarking of LLMs for German legal reasoning, from task design to metric-based evaluation.
The framework integrates workflows for expert annotation, configurable LLM execution, and multiple evaluation approaches including lexical, semantic, factual, and judge-based metrics.
BenGER is designed to improve transparency and reproducibility by keeping the benchmarking pipeline in one system rather than splitting it across separate scripts and platforms.
It enables multi-organization projects with tenant isolation and role-based access control, and it can optionally deliver formative, reference-grounded feedback to annotators.
The authors plan a live deployment demonstration covering benchmark creation through to analysis, showing the platform’s practical collaborative usage.

Abstract

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

Black Hat USA

AI Business

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

Dev.to

Factory hits $1.5B valuation to build AI coding for enterprises

TechCrunch

BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

Factory hits $1.5B valuation to build AI coding for enterprises

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer