Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
arXiv cs.CV / 5/4/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Autoregressive Large Vision-Language Models (LVLMs) can suffer from “Visual Signal Dilution,” where growing text history causes visual attention to decay as output sequences get longer.
- The paper proposes Persistent Visual Memory (PVM), a lightweight learnable module that provides sustained, on-demand visual perception during generation.
- PVM is integrated as a parallel branch alongside the LVN model’s Feed-Forward Network (FFN), using a distance-agnostic retrieval path to inject visual embeddings directly for more stable perception.
- Experiments on Qwen3-VL show consistent accuracy gains across both 4B and 8B model sizes, especially for complex reasoning tasks requiring persistent visual attention.
- Additional analysis indicates PVM resists length-induced signal decay and can speed up internal prediction convergence.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge
CLMA Frame Test
Dev.to
You Are Right — You Don't Need CLAUDE.md
Dev.to
Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to