Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes SentryFuse to enable efficient multimodal model compression for edge devices under fluctuating power budgets and unpredictable sensor dropout.
  • SentryGate learns modality-conditioned importance scores during training and then prunes attention heads and feed-forward channels at deployment without requiring post-compression fine-tuning.
  • SentryAttend replaces dense self-attention with sparse grouped-query attention to reduce compute bottlenecks in multimodal architectures.
  • Experiments across multiple multimodal applications and backbones show average accuracy gains over pruning baselines (12.7% generally, up to 18% with modality dropout) while reducing memory (28.2%) and latency (up to 1.63×) without fine-tuning.

Abstract

Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over 10\times the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to 1.63\times without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.