Abstract
The scores of distance-based outlier detection methods are difficult to interpret, and it is challenging to determine a suitable cut-off threshold between normal and outlier data points without additional context. We describe a generic transformation of distance-based outlier scores into interpretable, probabilistic estimates. The transformation is ranking-stable and increases the contrast between normal and outlier data points. Determining distance relationships between data points is necessary to identify the nearest-neighbor relationships in the data, yet most of the computed distances are typically discarded. We show that the distances to other data points can be used to model distance probability distributions and, subsequently, use the distributions to turn distance-based outlier scores into outlier probabilities. Over a variety of tabular and image benchmark datasets, we show that the probabilistic transformation does not impact outlier ranking (ROC AUC) or detection performance (AP, F1), and increases the contrast between normal and outlier score distributions (statistical distance). The experimental findings indicate that it is possible to transform distance-based outlier scores into interpretable probabilities with increased contrast between normal and outlier samples. Our work generalizes to a wide range of distance-based outlier detection methods, and, because existing distance computations are used, it adds no significant computational overhead.
Original language | English |
---|---|
Pages (from-to) | 782-802 |
Number of pages | 21 |
Journal | Machine Learning and Knowledge Extraction |
Volume | 5 |
Issue number | 3 |
DOIs | |
Publication status | Published - Sept 2023 |
Keywords
- anomaly detection
- anomaly scores
- distance distribution
- novelty detection
- outlier detection
- outlier probabilities
- outlier scores
- score contrast
- score distribution
- score normalization