QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The QU-NLP team proposes a unified model for ArchEHR-QA 2026 that tackles both answer generation and evidence sentence alignment using a single end-to-end system design.
For answer generation (Subtask 3), they fine-tune Qwen3-4B with a two-stage quantized LoRA (QLoRA) pipeline: first on 30,000 emrQA-MedSQuAD samples for clinical-domain adaptation, then on 20 annotated development cases for task-specific output style.
The resulting system scores 32.87 overall on the official test-2026 split for Subtask 3, with reported metrics including BLEU 9.42, ROUGE-L 27.04, and BERTScore 43.00.
For evidence alignment (Subtask 4), they combine three retrieval approaches (BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder) into a weighted ensemble, reaching micro-F1 67.16 on a 100-case test set.
Their experiments suggest the core limitation is that 20 annotated training cases are not enough to reliably separate relevant from irrelevant clinical sentences, making data augmentation the most promising next step.

Abstract

We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.