TY - JOUR
T1 - Little data is often enough for distance-based outlier detection
AU - Muhr, David
AU - Affenzeller, Michael
N1 - Publisher Copyright:
© 2022 The Authors. Published by Elsevier B.V.
PY - 2022
Y1 - 2022
N2 - Many real-world use cases benefit from fast training and prediction times, and much research went into speeding up distance-based outlier detection methods to millions of data points. Contrary to popular belief, our findings suggest that little data is often enough for distance-based outlier detection models. We show that using only a tiny fraction of the data to train distance-based outlier detection models often leads to no significant reduction in predictive performance and detection variance over a wide range of tabular datasets. Furthermore, we compare a data reduction based on random subsampling and clustering-based prototypes and show that both approaches yield similar outlier detection results. Simple random subsampling, thus, proves to be a useful benchmark and baseline for future research on speeding up distance-based outlier detection.
AB - Many real-world use cases benefit from fast training and prediction times, and much research went into speeding up distance-based outlier detection methods to millions of data points. Contrary to popular belief, our findings suggest that little data is often enough for distance-based outlier detection models. We show that using only a tiny fraction of the data to train distance-based outlier detection models often leads to no significant reduction in predictive performance and detection variance over a wide range of tabular datasets. Furthermore, we compare a data reduction based on random subsampling and clustering-based prototypes and show that both approaches yield similar outlier detection results. Simple random subsampling, thus, proves to be a useful benchmark and baseline for future research on speeding up distance-based outlier detection.
KW - anomaly detection
KW - clustering
KW - k-means
KW - knn
KW - local outlier factor
KW - lof
KW - nearest neighbors
KW - outlier detection
KW - prototypes
KW - unsupervised
UR - http://www.scopus.com/inward/record.url?scp=85127803636&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2022.01.297
DO - 10.1016/j.procs.2022.01.297
M3 - Conference article
AN - SCOPUS:85127803636
VL - 200
SP - 984
EP - 992
JO - Procedia Computer Science
JF - Procedia Computer Science
T2 - 3rd International Conference on Industry 4.0 and Smart Manufacturing, ISM 2021
Y2 - 19 November 2021 through 21 November 2021
ER -