SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

SpikeMLLM is a new spike-based framework for Multimodal Large Language Models (MLLMs) aimed at reducing inference compute and energy use in resource-constrained settings.
It addresses key spiking challenges for multimodality by using Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED), instead of relying on uniform spike encoding.
The method introduces Temporally Compressed LIF (TC-LIF) to compress timesteps from T=L-1 down to T=log2(L)-1, cutting the high overhead from unfolding long-resolution image inputs.
Experiments across four MLLMs and multiple multimodal benchmarks show near-lossless accuracy even under aggressive timestep compression (Tv/Tt=3/4), with small performance gaps versus FP16 baselines.
A dedicated RTL accelerator designed for the spike datapath achieves 9.06x higher throughput and 25.8x better power efficiency than an FP16 GPU baseline in a co-design deployment setup.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress but incur substantial computational overhead and energy consumption during inference, limiting deployment in resource-constrained environments. Spiking Neural Networks (SNNs), with their sparse event-driven computation, offer inherent energy efficiency advantages on neuromorphic hardware, yet extending them to MLLMs faces two key challenges: heterogeneous modalities make uniform spike encoding insufficient, and high-resolution image inputs amplify timestep unfolding overhead. We propose SpikeMLLM, the first spike-based framework for MLLMs, which unifies existing ANN quantization methods in the spiking representation space and incorporates Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED) and Temporally Compressed LIF (TC-LIF) for timestep compression from T=L-1 to T=log2(L)-1. Experiments on four representative MLLMs across diverse multimodal benchmarks show that SpikeMLLM maintains near-lossless performance under aggressive timestep compression (Tv/Tt=3/4), with average gaps of only 0.72% and 1.19% relative to the FP16 baseline on InternVL2-8B and Qwen2VL-72B. We further develop a dedicated RTL accelerator tailored to the spike-driven datapath, observing 9.06x higher throughput and 25.8x better power efficiency relative to an FP16 GPU baseline under a deployment-oriented co-design setting, suggesting the promise of algorithm-hardware co-design for efficient multimodal intelligence.

How I Use GitHub Copilot + RapidForge to Generate Daily Stock Ideas

Dev.to

Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

Dev.to

Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

MarkTechPost

I audited my own Claude Code setup and found 21 issues in 72 artifacts

Dev.to

Design Patterns for Prompt Engineering: Toward a Formal Discipline

Dev.to

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

Key Points

Abstract

Related Articles

How I Use GitHub Copilot + RapidForge to Generate Daily Stock Ideas

Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

I audited my own Claude Code setup and found 21 issues in 72 artifacts

Design Patterns for Prompt Engineering: Toward a Formal Discipline

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer