Unmasking Biases and Reliability Concerns in Convolutional Neural Networks Analysis of Cancer Pathology Images

arXiv cs.AI / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study analyzes 13 widely used cancer benchmark datasets using four CNN architectures across cancer types such as melanoma, carcinoma, colorectal cancer, and lung cancer to evaluate current practices.
It finds that CNNs can achieve high accuracy (up to about 93%) on datasets composed of cropped background segments without clinical content, challenging the validity of such benchmarks.
The results indicate that some architectures are more biased than others, suggesting that common ML evaluation methods may yield unreliable conclusions in cancer pathology.
The authors warn that these biases are hard to detect and may mislead researchers who rely on benchmark datasets, underscoring the need for more robust evaluation approaches.

Abstract

Convolutional Neural Networks have shown promising effectiveness in identifying different types of cancer from radiographs. However, the opaque nature of CNNs makes it difficult to fully understand the way they operate, limiting their assessment to empirical evaluation. Here we study the soundness of the standard practices by which CNNs are evaluated for the purpose of cancer pathology. Thirteen highly used cancer benchmark datasets were analyzed, using four common CNN architectures and different types of cancer, such as melanoma, carcinoma, colorectal cancer, and lung cancer. We compared the accuracy of each model with that of datasets made of cropped segments from the background of the original images that do not contain clinically relevant content. Because the rendered datasets contain no clinical information, the null hypothesis is that the CNNs should provide mere chance-based accuracy when classifying these datasets. The results show that the CNN models provided high accuracy when using the cropped segments, sometimes as high as 93\%, even though they lacked biomedical information. These results show that some CNN architectures are more sensitive to bias than others. The analysis shows that the common practices of machine learning evaluation might lead to unreliable results when applied to cancer pathology. These biases are very difficult to identify, and might mislead researchers as they use available benchmark datasets to test the efficacy of CNN methods.