Are LLMs More Skeptical of Entertainment News?

arXiv cs.AI / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study investigates whether zero-shot LLMs apply different credibility standards across journalistic genres, specifically entertainment news versus hard news.
  • Using a within-dataset design on GossipCop from FakeNewsNet, two of four frontier models (DeepSeek-V3.2 and GPT-5.2) show significantly higher false-positive rates for legitimate entertainment news, with gaps of 10.1 and 8.8 percentage points respectively.
  • Two other models (Claude Opus 4.6 and Gemini 3 Flash) do not exhibit comparable genre asymmetry, indicating the effect is model-dependent.
  • Style-swap experiments produce only limited, inconsistent changes, and prompt-based mitigation is not generic: framing the model as an entertainment-news fact-checker cuts DeepSeek-V3.2 false positives by about 50% without measurable recall loss, while helping GPT-5.2 little.
  • Qualitative analysis suggests recurring error patterns, including assuming private-life claims are inherently unverifiable and treating entertainment journalism as an epistemically weaker genre, implying that accuracy metrics can hide structured false positives.

Abstract

Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both p < .001), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.