CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper presents a system for SemEval-2026 Task 6 (CLARITY) focused on detecting response clarity and evasion in question–answer pairs from U.S. presidential interviews.
  • Results show an LLM ensemble achieves 80 macro-F1 on the 3-class Task 1 and 59 on the 9-class Task 2, indicating strong performance across both label granularities.
  • For transformer encoders, a four-stage training pipeline with partial encoder layer unfreezing outperforms full fine-tuning by a wide margin, and ensembling English plus multilingual encoders boosts overall accuracy.
  • Surprisingly, prompt-based LLMs without task-specific parameter updates outperform fine-tuned encoders, especially on minority classes, and for open-weight LLMs parameter count alone does not predict effectiveness.
  • The study finds that enriching inputs by concatenating the full interviewer turn improves LLM performance but not encoder performance, while the main remaining error is the Clear Reply/Ambivalent boundary, consistent with human annotation disagreement.

Abstract

In this paper, we present our system for SemEval-2026 Task 6 (CLARITY) on response clarity and evasion detection in question-answer pairs from U.S. presidential interviews, comparing fine-tuned encoders with prompt-based LLMs. Our LLM ensemble achieves 80 macro-F1 on the 3-class Task 1 (9th/41) and 59 on the 9-class Task 2 (3rd/33). Across 8 transformer encoders optimized through a four-stage pipeline, partial encoder layer unfreezing outperforms full fine-tuning by a wide margin. Combining English and multilingual encoders further improves ensemble performance over either family alone, despite multilingual models being individually weaker. Prompt-based LLMs, without any task-specific parameter updates, outperform fine-tuned encoders, particularly on minority classes; among open-weight LLMs, parameter count does not predict performance. Enriched input, concatenating the full interviewer turn, improves LLM performance but not that of encoders, an effect that persists with Longformer's extended context window, suggesting the divergence is not attributable to sequence-length capacity alone in our settings. The Clear Reply/Ambivalent boundary remains the dominant failure mode, mirroring the disagreement among human annotators. Our code, prompts, model configurations, and results are publicly available.