StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • StreamCacheVGGT is a training-free framework for reconstructing dense 3D geometry from streaming video under a strict constant memory budget.
  • The method improves token caching by replacing fragile, single-layer scoring with Cross-Layer Consistency-Enhanced Scoring (CLCES), which tracks token importance trajectories across the Transformer hierarchy to reduce activation noise.
  • Instead of pure eviction, StreamCacheVGGT uses Hybrid Cache Compression (HCC) with a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment in the key-vector manifold.
  • Experiments on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) show new state-of-the-art results, achieving better reconstruction accuracy and longer-term stability while maintaining constant-cost constraints.

Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing O(1) frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.