No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus
arXiv cs.CL / 4/20/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study tests how LLMs respond to user prompts that vary in politeness and impoliteness using the PLUM corpus across three languages (English, Hindi, Spanish) and five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, Llama 3).
- Across 22,500 prompt–response pairs, the researchers evaluate politeness on five interaction histories (raw, polite, impolite) and five politeness levels with an eight-factor framework covering quality and safety dimensions such as coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability.
- Polite prompts can improve average response quality by up to about 11%, while impolite tones degrade it, but these effects are not universal and vary substantially by language and model.
- The findings suggest language-specific best practices for tone: English favors courteous or direct wording, Hindi prefers deferential and indirect phrasing, and Spanish performs better with more assertive tone.
- The paper releases PLUM, a publicly available multilingual dataset of human-validated prompts, to support reproducibility and future hypothesis testing on politeness theory for LLM behavior.
Related Articles
Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]
Reddit r/MachineLearning

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark
Dev.to

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting
Dev.to

The $20/month AI subscription is gaslighting developers in emerging markets
Dev.to

A Claude Code hook that warns you before calling a low-trust MCP server
Dev.to