ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ABMamba, a fully open multimodal large language model designed specifically for efficient video captioning with long temporal sequences.
It tackles the quadratic compute bottleneck of Transformer attention by using Deep State Space Models as the language backbone and a linear-complexity alternative to attention.
ABMamba’s key innovation is an “Aligned Hierarchical Bidirectional Scan” module that processes video information at multiple temporal resolutions to better capture temporal dependencies.
On benchmarks like VATEX and MSR-VTT, the model achieves competitive captioning quality relative to typical MLLMs while improving throughput by about 3x.
Overall, the work targets scalability for video understanding workloads by reducing compute costs without heavily sacrificing benchmark performance.

Abstract

In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Key Points

Abstract

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer