SWAN: World-Aware Adaptive Multimodal Networks for Runtime Variations

arXiv cs.LG / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SWAN, a sample- and world-aware adaptive multimodal neural network designed to handle real-world runtime variations such as modality quality changes, input complexity shifts, and fluctuating compute resources.
  • SWAN combines a quality-aware controller (to allocate computation across modalities under a user-specified max budget), an adaptive gating module (to scale layer usage based on sample complexity), and a token-dropping module (to mask semantically irrelevant multimodal features) to improve compute efficiency.
  • The approach targets a key limitation of existing methods, which often fail to simultaneously respect strict compute budgets, account for input complexity, and adapt to multiple runtime factors.
  • Experiments in autonomous driving for complex multi-object 3D detection show up to a 49% reduction in FLOPs with minimal performance degradation.
  • The work positions SWAN as an early research advance toward more robust multimodal inference pipelines that maximize the value of compute spent under constraints.

Abstract

Multimodal deep neural networks deployed in realistic environments must contend with runtime variations: changes in modality quality, overall input complexity, and available platform resources. Current networks struggle with such fluctuations -- adaptive networks cannot adhere to a strict compute budget, controller-based networks neglect to consider input complexity, and statically provisioned networks fail at all the above. Consequently, they do not extract maximum utility from the expended computational resources. We present SWAN (Sample and World-Aware Multimodal Network), the first adaptive multimodal network that accomplishes all three goals. SWAN employs a quality-aware controller to assign resources among modalities according to a variable user-specified maximum budget. Within this budget, an adaptive gating module further optimizes efficiency by scaling layer utilization according to sample complexity. For further gains, SWAN also employs a token dropping module that masks semantically irrelevant multimodal features before performing detections. We evaluate SWAN in the domain of autonomous driving with complex multi-object 3D detection, reducing FLOPs by up to 49% with minimal degradation.