Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a multimodal evaluation framework to test how Multimodal Large Language Models handle Chinese short-video misinformation that is intertwined with cognitive biases.
It introduces a manually annotated dataset of 200 short videos across four health domains, with evidence-verified labels for three deceptive patterns: experimental errors, logical fallacies, and fabricated claims.
Eight frontier MLLMs are evaluated under five modality settings, with Gemini-2.5-Pro achieving the best multimodal performance (belief score 71.5/100) and o3 the lowest (35.2).
The study analyzes social cues in misinformation videos and finds models can form false beliefs due to biases such as authoritative channel IDs.

Abstract

Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns-experimental errors, logical fallacies, and fabricated claims-each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

Dev.to

When a memorized rule fits your bug too well: a meta-trap of agent workflows

Dev.to

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Key Points

Abstract

Related Articles

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

When a memorized rule fits your bug too well: a meta-trap of agent workflows

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer