Compared to What? Baselines and Metrics for Counterfactual Prompting

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that counterfactual prompting results can be misattributed because edits mix the intended factor change with incidental surface-form variation, violating assumptions needed for causal attribution.
  • Using MedQA, it shows that prediction flip rates from changing patient gender (14.9%) are statistically indistinguishable from those caused by paraphrasing (14.1%), undermining claims of special sensitivity to gender.
  • The authors propose a statistical framework that benchmarks targeted interventions against effects from meaning-preserving paraphrases to robustly isolate the causal impact of the targeted factor.
  • Reanalyzing MedPerturb, they find previously reported sensitivity to demographics and stylistic cues largely disappears, with only 5 out of 120 tests remaining significant; however, applying the framework to occupational biography classification reveals significant directional gender bias.
  • Across multiple evaluation metrics, the study finds per-sample distributional metrics outperform aggregate metrics, while regression metrics uniquely characterize the direction and magnitude of effects.

Abstract

Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance. Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics -- aggregate, per-sample distributional, and regression -- and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.