MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

arXiv cs.AI / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces MedSkillAudit, a domain-specific audit framework designed to assess medical research AI agent skills with safeguards beyond general evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety.
  • Using 75 medical research agent skills across five categories, the framework produces a quality score (0–100), a release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag based on expert-assigned judgments.
  • MedSkillAudit’s consensus quality score averaged 72.4, but 57.3% of skills were rated below the “Limited Release” threshold, indicating that many skills may not yet meet medical research readiness standards.
  • Reliability results show system–expert agreement of ICC(2,1) = 0.449, which surpasses the human inter-rater ICC benchmark of 0.300, suggesting the framework can align with expert review better than individual human raters do.
  • Agreement varies by category, with Protocol Design showing the strongest alignment (ICC = 0.551) while Academic Writing performs poorly (negative ICC), pointing to rubric–expert mismatch that may require redesign.

Abstract

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.