MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses multimodal sentiment analysis by improving interpretability and robustness of multimodal large language models, which are often treated as end-to-end “black boxes.”
  • It introduces structured Discrimination-Calibration (DC) reasoning and pairs it with hint-guided reinforcement learning to tackle RL’s low exploration efficiency and sparse rewards on hard samples.
  • The method begins with a cold-start supervised fine-tuning stage using high-quality chain-of-thought data synthesized by a teacher model (Qwen3Omni-30B), embedding the DC reasoning structure from the outset.
  • It then proposes “Hint-GRPO,” using the discrimination stage as a verifiable anchor to provide directional hints during RL, improving learning efficiency and reducing reward sparsity.
  • Experiments on Qwen2.5Omni-7B show higher accuracy for fine-grained sentiment regression, high-quality structured reasoning chains, and better cross-domain generalization.

Abstract

Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.

MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis | AI Navigate