Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Autoregressive Large Vision-Language Models (LVLMs) can suffer from “Visual Signal Dilution,” where growing text history causes visual attention to decay as output sequences get longer.
The paper proposes Persistent Visual Memory (PVM), a lightweight learnable module that provides sustained, on-demand visual perception during generation.
PVM is integrated as a parallel branch alongside the LVN model’s Feed-Forward Network (FFN), using a distance-agnostic retrieval path to inject visual embeddings directly for more stable perception.
Experiments on Qwen3-VL show consistent accuracy gains across both 4B and 8B model sizes, especially for complex reasoning tasks requiring persistent visual attention.
Additional analysis indicates PVM resists length-induced signal decay and can speed up internal prediction convergence.

Abstract

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

You Are Right — You Don't Need CLAUDE.md

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Key Points

Abstract

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

You Are Right — You Don't Need CLAUDE.md

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer