Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Multimodal large language models (MLLMs) often perform poorly on numerical regression with long-tailed (imbalanced) target distributions, tending to regress toward the mean due to biased token-level supervised fine-tuning.
The paper identifies a key gap in existing training: insufficient cross-sample relational supervision that would let the model learn how predictions compare across a batch.
It proposes a distribution-aware reinforcement learning approach using Group Relative Policy Optimization and a Concordance Correlation Coefficient-based reward to better match correlation, scale, and mean between predictions and targets.
The method is plug-and-play, requiring no architectural changes, and it yields consistent gains on long-tailed regression benchmarks, especially in medium- and few-shot settings.
Overall, the work suggests that batch-level, comparison-based learning signals can substantially improve MLLM numerical regression for imbalanced data.

Abstract

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Dev.to

Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Reddit r/LocalLLaMA

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Last Week in AI

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Reddit r/LocalLLaMA

Uber Shares What Happens When 1.500 AI Agents Hit Production

Reddit r/artificial

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Key Points

Abstract

Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Uber Shares What Happens When 1.500 AI Agents Hit Production

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer