$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

arXiv cs.CV / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes PKS^4, a new video-understanding architecture that keeps a standard 2D vision backbone for spatial semantics while replacing expensive temporal attention with linear-complexity temporal scanning.
PKS^4 introduces a plug-and-play PKS^4 module that uses a Kinematic Prior Encoder to derive motion-related priors (e.g., local displacements and motion boundaries) from inter-frame correlations and differences.
The kinematic priors guide linear-complexity state space models (SSMs) that adaptively modulate update speeds and read-write behaviors at each time step to better track underlying kinematic states.
By deploying parallel temporal scanners per spatial location, PKS^4 preserves spatial structure while reducing activation memory overhead compared with deep adapter insertion strategies.
Experiments on action-recognition benchmarks report state-of-the-art results, with fast convergence (about 20 epochs) and roughly 10× lower training compute than pure video SSMs.

Abstract

Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS

^4

). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS

^4

module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS

^4

achieves state-of-the-art performance. Remarkably, our method converges in merely

20

epochs, achieving approximately

10\times

lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding.

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

Key Points

Abstract

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer