From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
arXiv cs.AI / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current vision-language models struggle at embodied egocentric tasks because they depend on temporal priors learned from passive video, which can cause spatiotemporal hallucinations and weak generalization in dynamic settings.
- It introduces EgoTSR, a curriculum-based learning framework that stages reasoning from explicit spatial understanding to task-state assessment and ultimately long-horizon planning.
- To enable this training paradigm, the authors build EgoTSR-Data, a 46M-sample dataset arranged into three supervision stages: Chain-of-Thought (CoT), weakly supervised tagging, and long-horizon sequences.
- Experiments report that EgoTSR removes chronological biases and reaches 92.4% accuracy on long-horizon logical reasoning tasks while preserving high perceptual precision, outperforming prior state-of-the-art models.




