Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies counterfactual unfairness in LLMs using humor as a lens on social assumptions learned from training data.
It proposes a framework covering three tasks—humor generation refusal, speaker intention inference, and relational/societal impact prediction—across both identity-agnostic and identity-disparaging humor.
The researchers introduce interpretable bias metrics to quantify asymmetric behavior when the identity of the speaker or addressee is swapped.
Experiments on state-of-the-art LLMs show consistent disparities, including higher refusal rates (up to 67.5%), higher maliciousness judgments (64.7%), and increased social-harm ratings (up to 1.5 points on a 5-point scale) for jokes attributed to privileged speakers.
The findings suggest that generative models may simultaneously exhibit sensitivity and stereotyping, making fairness and cultural alignment more difficult to achieve.

Abstract

Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model's responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These patterns highlight how sensitivity and stereotyping coexist in generative models, complicating efforts toward fairness and cultural alignment.

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer