Compared to What? Baselines and Metrics for Counterfactual Prompting
arXiv cs.CL / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that counterfactual prompting results can be misattributed because edits mix the intended factor change with incidental surface-form variation, violating assumptions needed for causal attribution.
- Using MedQA, it shows that prediction flip rates from changing patient gender (14.9%) are statistically indistinguishable from those caused by paraphrasing (14.1%), undermining claims of special sensitivity to gender.
- The authors propose a statistical framework that benchmarks targeted interventions against effects from meaning-preserving paraphrases to robustly isolate the causal impact of the targeted factor.
- Reanalyzing MedPerturb, they find previously reported sensitivity to demographics and stylistic cues largely disappears, with only 5 out of 120 tests remaining significant; however, applying the framework to occupational biography classification reveals significant directional gender bias.
- Across multiple evaluation metrics, the study finds per-sample distributional metrics outperform aggregate metrics, while regression metrics uniquely characterize the direction and magnitude of effects.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to