Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

arXiv cs.AI / 5/7/2026

💬 OpinionModels & Research

共有:

Key Points

The paper addresses a gap in structured patient-safety risk assessment methods for LLM-generated clinical text, proposing an FMECA-based approach tailored to generated summaries.
An interdisciplinary panel developed a taxonomy of 14 failure modes and adapted standard FMECA dimensions (occurrence, severity, detectability) into 5-point ordinal scales for scoring risk.
The framework was validated by having reviewers annotate 36 generated discharge summaries (from four patients) produced by an open LLM (GPT-OSS 120B) using real clinical data from Geneva University Hospitals.
Results show improved inter-rater reliability across annotation rounds, with moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring.
Usability and content validity were supported by an adapted System Usability Scale, yielding a mean SUS score of 79.2/100 and high evaluator confidence.

Abstract

Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.

Why GPU Density Just Broke Two Decades of Data Centre Design Assumptions

Dev.to

From Demos to Guardrails: 10 Reddit Threads Tracking the AI-Agent Shift

Dev.to

What Reddit’s Agent Builders Were Actually Debugging This Week

Dev.to

Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets

MarkTechPost

Qwen 3.6?

Reddit r/LocalLLaMA

Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

Key Points

Abstract

Related Articles

Why GPU Density Just Broke Two Decades of Data Centre Design Assumptions

From Demos to Guardrails: 10 Reddit Threads Tracking the AI-Agent Shift

What Reddit’s Agent Builders Were Actually Debugging This Week

Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets

Qwen 3.6?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer