From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current vision-language models struggle at embodied egocentric tasks because they depend on temporal priors learned from passive video, which can cause spatiotemporal hallucinations and weak generalization in dynamic settings.
It introduces EgoTSR, a curriculum-based learning framework that stages reasoning from explicit spatial understanding to task-state assessment and ultimately long-horizon planning.
To enable this training paradigm, the authors build EgoTSR-Data, a 46M-sample dataset arranged into three supervision stages: Chain-of-Thought (CoT), weakly supervised tagging, and long-horizon sequences.
Experiments report that EgoTSR removes chronological biases and reaches 92.4% accuracy on long-horizon logical reasoning tasks while preserving high perceptual precision, outperforming prior state-of-the-art models.

Abstract

Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought (CoT) supervision, weakly supervised tagging, and long-horizon sequences. Extensive experiments demonstrate that EgoTSR effectively eliminates chronological biases, achieving 92.4% accuracy on long-horizon logical reasoning tasks while maintaining high fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.