LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations

arXiv cs.CV / 5/4/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper studies incomplete multimodal learning (IML) in a more realistic training setting where some modalities are missing at training time, removing the usual assumption of full-modal reconstruction supervision.
  • It introduces LIMSSR, which reframes incomplete multimodal prediction as a conditional sequence-to-score reasoning problem and uses LLM-guided context-aware modality imputation to infer latent semantics.
  • Instead of direct reconstruction, LIMSSR fuses multidimensional representations to learn from only the available modalities and related context.
  • To reduce hallucinations, the method uses a Mask-Aware Dual-Path Aggregation mechanism that dynamically calibrates inference uncertainty.
  • Experiments on three Action Quality Assessment datasets show LIMSSR significantly outperforms existing baselines while not requiring complete training data, and the authors provide released code.

Abstract

Real-world multimodal learning is often hindered by missing modalities. While Incomplete Multimodal Learning (IML) has gained traction, existing methods typically rely on the unrealistic assumption of full-modal availability during training to provide reconstruction supervision or cross-modal priors. This paper tackles the more challenging setting of IML under training-time incomplete observations, which precludes reliance on a ``God's eye view'' of complete data. We propose LIMSSR (LLM-Driven Incomplete Multimodal Sequence-to-Score Reasoning), a framework that reformulates this challenge as a conditional sequence reasoning task. LIMSSR leverages the semantic reasoning capabilities of Large Language Models via Prompt-Guided Context-Aware Modality Imputation and Multidimensional Representation Fusion to infer latent semantics from available contexts without direct reconstruction. To mitigate hallucinations, we introduce a Mask-Aware Dual-Path Aggregation to dynamically calibrate inference uncertainty. Extensive experiments on three Action Quality Assessment datasets demonstrate that LIMSSR significantly outperforms state-of-the-art baselines without relying on complete training data, establishing a new paradigm for data-efficient multimodal learning. Code is available at https://github.com/XuHuangbiao/LIMSSR.