Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU
arXiv cs.CL / 3/13/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper announces a new long-form release of the Swiss Parliaments Corpus, converting multi-hour Swiss German debates into high-quality speech-text pairs aligned with official protocols.
- The pipeline transcribes audio with Whisper Large-v3 to Standard German under high compute settings, then applies a two-step GPT-4o correction to refine misrecognitions (notably named entities) and assess semantic completeness.
- Segments are filtered using a Predicted BLEU score and GPT-4o evaluation, resulting in 801 hours of audio, with 555 hours passing quality control.
- Compared to the original sentence-level release, SPC_R achieves a 6-point BLEU improvement, demonstrating the effectiveness of combining robust ASR, LLM-based correction, and data-driven filtering for low-resource, domain-specific corpora.
Related Articles
We asked 200 ChatGPT users their biggest frustration. All top 5 answers are problems ChatGPT Toolbox solves.
Reddit r/artificial
I Built an AI That Reviews Every PR for Security Bugs — Here's How (2026)
Dev.to
[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning
Reddit r/MachineLearning
How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails
Dev.to
Complete Guide: How To Make Money With Ai
Dev.to