A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification

arXiv cs.CV / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a trustworthy SAR-only Vision Transformer baseline for sea-ice classification, explicitly avoiding claims of a fully validated multimodal system.
  • It trains ViT-Base and ViT-Large models on the AI4Arctic/ASIP Sea Ice Dataset (v2) using Sentinel-1 Extra Wide full-resolution inputs, leakage-aware stratified patch splitting, SIGRID-3 development labels, and training-set normalization.
  • Experiments compare cross-entropy and weighted cross-entropy for ViT-Base versus focal loss for ViT-Large to address severe class imbalance among morphologically similar ice types.
  • ViT-Large with focal loss achieves 69.6% held-out accuracy, 68.8% weighted F1, and strong minority-class performance for Multi-Year Ice (83.9% precision), showing improved precision–recall trade-offs versus weighted cross-entropy.
  • The authors position focal-loss ViT results as a cleaner reference point for future fusion work that combines SAR with optical, thermal, or meteorological data.

Abstract

Accurate and automated sea ice classification is important for climate monitoring and maritime safety in the Arctic. While Synthetic Aperture Radar (SAR) is the operational standard because of its all-weather capability, it remains challenging to distinguish morphologically similar ice classes under severe class imbalance. Rather than claiming a fully validated multimodal system, this paper establishes a trustworthy SAR only baseline that future fusion work can build upon. Using the AI4Arctic/ASIP Sea Ice Dataset (v2), which contains 461 Sentinel-1 scenes matched with expert ice charts, we combine full-resolution Sentinel-1 Extra Wide inputs, leakage-aware stratified patch splitting, SIGRID-3 stage-of-development labels, and training-set normalization to evaluate Vision Transformer baselines. We compare ViT-Base models trained with cross entropy and weighted cross-entropy against a ViT-Large model trained with focal loss. Among the tested configurations, ViT-Large with focal loss achieves 69.6% held-out accuracy, 68.8% weighted F1, and 83.9% precision on the minority Multi-Year Ice class. These results show that focal-loss training offers a more useful precision-recall trade-off than weighted cross-entropy for rare ice classes and establishes a cleaner baseline for future multimodal fusion with optical, thermal, or meteorological data.