Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

arXiv cs.CV / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current vision foundation models often lack pixel-level representations that capture both spatial and temporal (spatio-temporal) scene properties.
It introduces LILA, a framework that learns pixel-accurate feature descriptors directly from videos to support dense pixel-level prediction at scale.
LILA’s key method is “linear in-context learning,” using spatio-temporal cue maps such as depth and motion estimated by off-the-shelf networks.
Even though the depth/motion cues can be noisy, the approach trains effectively on uncurated video datasets and produces temporally consistent embeddings containing semantic and geometric information.
The authors report strong empirical improvements on multiple downstream tasks, including video object segmentation, surface normal estimation, and semantic segmentation.

Abstract

One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

Claude Opus 4.7: What Actually Changed and Whether You Should Migrate

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Sector HQ Daily AI Intelligence - April 30, 2026

Dev.to

The Inference Inflection: Why AI's Center of Gravity Has Shifted from Training to Inference

Dev.to

AI transparency index on pvgomes.com

Dev.to

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

Key Points

Abstract

Related Articles

Claude Opus 4.7: What Actually Changed and Whether You Should Migrate

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Sector HQ Daily AI Intelligence - April 30, 2026

The Inference Inflection: Why AI's Center of Gravity Has Shifted from Training to Inference

AI transparency index on pvgomes.com

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer