Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates cross-model consistency for AI-generated exercise prescriptions by repeatedly generating outputs (20 times each) for six clinical scenarios using GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash under temperature=0.
  • GPT-4.1 achieved the highest semantic similarity (0.955), with Gemini 2.5 Flash next (0.950) and Claude Sonnet 4.6 lower (0.903), and the differences across models were statistically significant.
  • Despite similar semantic similarity, the models exhibited fundamentally different behaviors: GPT-4.1 produced 100% unique outputs, while Gemini 2.5 Flash had only 27.5% unique outputs, implying duplication drove its high similarity rather than stable reasoning.
  • The study found that safety-expression scores were at ceiling levels across all models, making them insufficient to differentiate between models for this task.
  • The authors conclude that choosing a model for clinical LLM exercise-prescription systems is a clinical decision and that repeated-generation behavior should be a key criterion beyond single-output evaluations.

Abstract

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.