Surrogate modeling for interpreting black-box LLMs in medical predictions

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a surrogate modeling framework to quantitatively interpret what black-box LLMs encode, especially in a medical-prediction setting.
It approximates an LLM’s latent knowledge space for a domain-derived hypothesis by using observable input-output behavior collected via extensive prompting over many simulated scenarios.
Proof-of-concept experiments show how the framework can measure the extent to which an LLM “perceives” each input variable relative to the output.
The study reveals that LLM-encoded knowledge can include associations that contradict established medical knowledge and may retain scientifically refuted racial assumptions from training data.
The authors position the framework as a red-flag indicator to improve safe and reliable deployment by surfacing potentially inaccurate or biased model behavior early.

Abstract

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.