Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
arXiv cs.CV / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Focus-to-Perceive Representation Learning (FPRL) to improve endoscopic video analysis, targeting the clinical need for static, structured semantics when annotations are limited.
- FPRL is a cognition-inspired hierarchical framework that first learns intra-frame lesion-centric static semantics using teacher-prior adaptive masking (TPAM) and multi-view sparse sampling.
- It then learns contextual semantics across frames via cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP), aiming to reduce motion bias while maintaining temporal continuity.
- Experiments on 11 endoscopic video datasets show FPRL delivers stronger results across a range of downstream tasks, and the authors provide code on GitHub.
Related Articles

Black Hat Asia
AI Business

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

How to Create AI Videos in 20 Minutes (3 Free Tools, Zero Experience)
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to