AI Powered Image Analysis for Phishing Detection

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Phishing attacks increasingly evade text- and URL-based detectors by visually imitating legitimate sites through copied logos, layouts, and color schemes, motivating screenshot-based detection.
The paper proposes a deep learning framework for visual phishing detection using webpage screenshots, evaluating ConvNeXt-Tiny and ViT-Base with transfer learning from ImageNet and a dataset creation/preprocessing pipeline.
Results indicate ConvNeXt-Tiny achieves the best overall performance, with the highest F1-score at an optimized decision threshold and better efficiency than ViT-Base.
The study emphasizes threshold-aware evaluation (precision/recall/F1 across thresholds) to find operating points that balance true detection with controlled false alarms in real deployment.
As future work, the curated dataset will be released to support reproducibility, enabling further research and comparison under consistent experimental setups.

Abstract

Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.