Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU
arXiv cs.CL / 3/13/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper announces a new long-form release of the Swiss Parliaments Corpus, converting multi-hour Swiss German debates into high-quality speech-text pairs aligned with official protocols.
- The pipeline transcribes audio with Whisper Large-v3 to Standard German under high compute settings, then applies a two-step GPT-4o correction to refine misrecognitions (notably named entities) and assess semantic completeness.
- Segments are filtered using a Predicted BLEU score and GPT-4o evaluation, resulting in 801 hours of audio, with 555 hours passing quality control.
- Compared to the original sentence-level release, SPC_R achieves a 6-point BLEU improvement, demonstrating the effectiveness of combining robust ASR, LLM-based correction, and data-driven filtering for low-resource, domain-specific corpora.
Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像
Ledge.ai

The programming passion is melting
Dev.to

Best AI Tools for Property Managers in 2026
Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to