Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates cross-model consistency for AI-generated exercise prescriptions by repeatedly generating outputs (20 times each) for six clinical scenarios using GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash under temperature=0.
GPT-4.1 achieved the highest semantic similarity (0.955), with Gemini 2.5 Flash next (0.950) and Claude Sonnet 4.6 lower (0.903), and the differences across models were statistically significant.
Despite similar semantic similarity, the models exhibited fundamentally different behaviors: GPT-4.1 produced 100% unique outputs, while Gemini 2.5 Flash had only 27.5% unique outputs, implying duplication drove its high similarity rather than stable reasoning.
The study found that safety-expression scores were at ceiling levels across all models, making them insufficient to differentiate between models for this task.
The authors conclude that choosing a model for clinical LLM exercise-prescription systems is a clinical decision and that repeated-generation behavior should be a key criterion beyond single-output evaluations.

Abstract

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.