Geometric Context Transformer for Streaming 3D Reconstruction

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction that uses a geometric context transformer (GCT) architecture inspired by SLAM principles.
Its attention mechanism combines anchor context, a pose-reference window, and trajectory memory to improve coordinate grounding, leverage dense geometric cues, and correct long-range drift.
The method is designed to keep the streaming state compact while maintaining rich geometric information for stable, efficient inference.
Reported performance targets around 20 FPS at 518×378 input resolution and supports long sequences exceeding 10,000 frames.
Experiments on multiple benchmarks show the approach outperforms prior streaming methods and iterative optimization-based approaches.

Abstract

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

Runtime security for AI agents: risk scoring, policy enforcement, and rollback for production agent pipeline [P]

Reddit r/MachineLearning

Token Estimate for Qwen 3.5-397B. Based on official source only :)

Reddit r/LocalLLaMA

Anthropic Won't Fix the MCP Vulnerability — Here's How to Protect Your Server

Dev.to

Vercel Hack: Why You Need to Rotate Your "Non-Sensitive" Environment Variables Today

Dev.to

Researchers gave 1,222 people AI assistants, then took them away after 10 minutes. Performance crashed below the control group and people stopped trying. UCLA, MIT, Oxford, and Carnegie Mellon call it the "boiling frog" effect.

Reddit r/artificial

Geometric Context Transformer for Streaming 3D Reconstruction

Key Points

Abstract

Related Articles

Runtime security for AI agents: risk scoring, policy enforcement, and rollback for production agent pipeline [P]

Token Estimate for Qwen 3.5-397B. Based on official source only :)

Anthropic Won't Fix the MCP Vulnerability — Here's How to Protect Your Server

Vercel Hack: Why You Need to Rotate Your "Non-Sensitive" Environment Variables Today

Researchers gave 1,222 people AI assistants, then took them away after 10 minutes. Performance crashed below the control group and people stopped trying. UCLA, MIT, Oxford, and Carnegie Mellon call it the "boiling frog" effect.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer