Abstract
Many real-world use cases benefit from fast training and prediction times, and much research went into speeding up distance-based outlier detection methods to millions of data points. Contrary to popular belief, our findings suggest that little data is often enough for distance-based outlier detection models. We show that using only a tiny fraction of the data to train distance-based outlier detection models often leads to no significant reduction in predictive performance and detection variance over a wide range of tabular datasets. Furthermore, we compare a data reduction based on random subsampling and clustering-based prototypes and show that both approaches yield similar outlier detection results. Simple random subsampling, thus, proves to be a useful benchmark and baseline for future research on speeding up distance-based outlier detection.
Original language | English |
---|---|
Pages (from-to) | 984-992 |
Number of pages | 9 |
Journal | Procedia Computer Science |
Volume | 200 |
DOIs | |
Publication status | Published - 2022 |
Event | 3rd International Conference on Industry 4.0 and Smart Manufacturing, ISM 2021 - Linz, Austria Duration: 19 Nov 2021 → 21 Nov 2021 |
Keywords
- anomaly detection
- clustering
- k-means
- knn
- local outlier factor
- lof
- nearest neighbors
- outlier detection
- prototypes
- unsupervised