Data Validation Utilizing Expert Knowledge and Shape Constraints

Florian Bachinger, Lisa Ehrlinger, Gabriel Kronberger, Wolfram Wöss

Research output: Contribution to journalArticlepeer-review

Abstract

Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data. To enable automated data validation, we propose “shape constraint-based data validation,” a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data and enable the detection of invalid data that deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data. We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.

Original languageEnglish
Article number13
JournalJournal of Data and Information Quality
Volume16
Issue number2
DOIs
Publication statusPublished - 25 Jun 2024

Keywords

  • Data quality
  • expert knowledge integration
  • machine learning
  • shape-constraints

Fingerprint

Dive into the research topics of 'Data Validation Utilizing Expert Knowledge and Shape Constraints'. Together they form a unique fingerprint.

Cite this