Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Multimodal large language models (MLLMs) often perform poorly on numerical regression with long-tailed (imbalanced) target distributions, tending to regress toward the mean due to biased token-level supervised fine-tuning.
  • The paper identifies a key gap in existing training: insufficient cross-sample relational supervision that would let the model learn how predictions compare across a batch.
  • It proposes a distribution-aware reinforcement learning approach using Group Relative Policy Optimization and a Concordance Correlation Coefficient-based reward to better match correlation, scale, and mean between predictions and targets.
  • The method is plug-and-play, requiring no architectural changes, and it yields consistent gains on long-tailed regression benchmarks, especially in medium- and few-shot settings.
  • Overall, the work suggests that batch-level, comparison-based learning signals can substantially improve MLLM numerical regression for imbalanced data.

Abstract

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.