Cellwise Outliers

arXiv stat.ML / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper highlights a shift from traditional “casewise” outliers/anomalies to “cellwise outliers,” where individual entries in a data matrix or tensor deviate from the norm.
It explains that even a small fraction of outlying cells can corrupt a majority of cases in higher-dimensional settings, undermining existing casewise detection methods.
The authors argue that detecting cellwise outliers and building cellwise-robust estimators requires fundamentally different techniques than the casewise paradigm, sometimes sacrificing intuitive equivariance properties.
The review surveys recent progress in cellwise-robust estimation of location and covariance, regression, PCA, and tensor-data methods, noting that cellwise approaches are increasingly dominant for high-dimensional data and can handle missing values.

Abstract

In statistics and machine learning, the traditional meaning of the terms `outlier' and `anomaly' is a case in the dataset that behaves differently from the bulk of the data. This raises suspicion that it may belong to a different population. But nowadays increasing attention is being paid to so-called cellwise outliers. These are individual values somewhere in the data matrix (or data tensor). Depending on the dimension, even a relatively small proportion of outlying cells can contaminate over half the cases, which is a problem for existing casewise methods. It turns out that detecting cellwise outliers as well as constructing cellwise robust methods requires techniques that are quite different from the casewise setting. For instance, one has to let go of some intuitive equivariance properties. The problem is difficult, but the past decade has seen substantial progress. For high-dimensional data the cellwise approach is becoming dominant, and typically can deal with missing values as well. We review developments in the estimation of location and covariance matrices as well as regression methods, principal component analysis, methods for tensor data, and various other settings.