Unified Multimodal Uncertain Inference

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Unified Multimodal Uncertain Inference (UMUI), a task that requires models to output calibrated probability estimates for hypotheses using text, audio, video, or any combination of modalities.
  • It addresses a gap in prior work by moving beyond single-modality, binary entailment to enable fine-grained probabilistic reasoning across modalities.
  • The authors create a human-annotated evaluation dataset featuring scalar probability judgments across audio, visual, and audiovisual settings, and also test on existing text and audio benchmarks.
  • They introduce CLUE (Calibrated Latent Uncertainty Estimation), combining self-consistent teacher calibration with distribution-based confidence probing to improve calibration of predictions.
  • Results show their 3B-parameter model matches or outperforms baselines with up to 32B parameters across modalities.

Abstract

We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.