What is the scientific value of administering the standard Rorschach test to LLMs when the training data is almost certainly contaminated? (R) + [D]

Reddit r/MachineLearning / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • A paper in JMIR Mental Health tested three multimodal LLMs (GPT-4o, Grok 3, Gemini 2.0) using the standard 10 Rorschach inkblot cards and coded outputs with the Exner Comprehensive System to analyze “perceptual styles” and human-related themes.
  • The author of this discussion questions the study’s scientific validity, arguing the stimuli and scoring materials are widely available online and likely contaminated in model training, which undermines inferences about genuine perception.
  • The critique suggests the models may mainly be retrieving likely associations and performing pattern matching/text completion based on memorized psychometric content, rather than processing visual ambiguity.
  • Concerns are also raised about weak experimental controls—using public web interfaces with default settings and very small sample sizes—along with admissions that the models may have encountered the relevant concepts during training.
  • The post asks how such methodological issues could pass peer review and what meaningful conclusions studies like this can draw about how AI handles ambiguous images.

A recent paper published in JMIR Mental Health (Csigó & Cserey, 2026) caught my attention. The researchers administered the 10 standard Rorschach inkblot cards to three multimodal LLMs (GPT-4o, Grok 3, Gemini 2.0) and coded their responses using the Exner Comprehensive System. They analyzed the models' "perceptual styles," determinants (like human movement vs. color), and human-related content themes.

However, I am seriously struggling to understand the methodological validity of this setup, and I’m curious what the scientific community thinks. My main concerns are:
Massive Data Contamination: The 10 standard Rorschach cards, along with decades of psychological literature, scoring manuals (like the Exner system), and typical human responses, are widely available on the internet. It is highly probable that this data is already embedded in the models' training weights.
Testing Retrieval, Not Perception: Because they used the standard, century-old inkblots instead of novel, AI-generated, or strictly controlled ambiguous images, aren't they just testing the models' ability to retrieve the most statistically probable lexical associations for those specific images from their training data?
Lack of Controls: As I understand according to the paper, the researchers used the public web interfaces with default settings (no API, no temperature control) and seemingly only ran the test once per model, generating a tiny sample size.
Ironically, the authors explicitly admit in their "Limitations" section that the models likely encountered the stimuli and scoring concepts during training, which could influence outputs independently of any image understanding. So, methodologically what is the actual scientific value of conducting projective psychological tests on LLMs without using novel stimuli to - at least try - rule out data contamination? What do you think, based of mechanisms of LLMs, does a study like this tell us anything meaningful about how AI processes visual ambiguity, or is it merely demonstrating advanced pattern matching and text completion based on widely known psychometric data? And - how do studies with such glaring methodological loopholes regarding LLM training data contamination make it through peer review in decent journals? Maybe I'm a little bit critical here, I just wanted to be a little provocative. Here is the study: https://mental.jmir.org/2026/1/e88186?fbclid=IwY2xjawRd27dleHRuA2FlbQIxMQBzcnRjBmFwcF9pZBAyMjIwMzkxNzg4MjAwODkyAAEe-wkKP6fKZRmAAuNvtN6BjknolIGcfTGu0-cLFs6CC49kZ1gcR6ccdcaRiWA_aem_7hHg5G96xjDZ-04YlSs1Ew

submitted by /u/Impossible_Echo4029
[link] [comments]