Little data is often enough for distance-based outlier detection

Research output: Contribution to journalConference articlepeer-review

Abstract

Many real-world use cases benefit from fast training and prediction times, and much research went into speeding up distance-based outlier detection methods to millions of data points. Contrary to popular belief, our findings suggest that little data is often enough for distance-based outlier detection models. We show that using only a tiny fraction of the data to train distance-based outlier detection models often leads to no significant reduction in predictive performance and detection variance over a wide range of tabular datasets. Furthermore, we compare a data reduction based on random subsampling and clustering-based prototypes and show that both approaches yield similar outlier detection results. Simple random subsampling, thus, proves to be a useful benchmark and baseline for future research on speeding up distance-based outlier detection.

Original languageEnglish
Pages (from-to)984-992
Number of pages9
JournalProcedia Computer Science
Volume200
DOIs
Publication statusPublished - 2022
Event3rd International Conference on Industry 4.0 and Smart Manufacturing, ISM 2021 - Linz, Austria
Duration: 19 Nov 202121 Nov 2021

Keywords

  • anomaly detection
  • clustering
  • k-means
  • knn
  • local outlier factor
  • lof
  • nearest neighbors
  • outlier detection
  • prototypes
  • unsupervised

Fingerprint

Dive into the research topics of 'Little data is often enough for distance-based outlier detection'. Together they form a unique fingerprint.

Cite this