TY - GEN
T1 - Effects of data grouping on calibration measures of classifiers
AU - Dreiseitl, Stephan
AU - Osl, Melanie
PY - 2012
Y1 - 2012
N2 - The calibration of a probabilistic classifier refers to the extend to which its probability estimates match the true class membership probabilities. Measuring the calibration of a classifier usually relies on performing chi-squared goodness-of-fit tests between grouped probabilities and the observations in these groups. We considered alternatives to the Hosmer-Lemeshow test, the standard chi-squared test with groups based on sorted model outputs. Since this grouping does not represent "natural" groupings in data space, we investigated a chi-squared test with grouping strategies in data space. Using a series of artificial data sets for which the correct models are known, and one real-world data set, we analyzed the performance of the Pigeon-Heyse test with groupings by self-organizing maps, k-means clustering, and random assignment of points to groups. We observed that the Pigeon-Heyse test offers slightly better performance than the Hosmer-Lemeshow test while being able to locate regions of poor calibration in data space.
AB - The calibration of a probabilistic classifier refers to the extend to which its probability estimates match the true class membership probabilities. Measuring the calibration of a classifier usually relies on performing chi-squared goodness-of-fit tests between grouped probabilities and the observations in these groups. We considered alternatives to the Hosmer-Lemeshow test, the standard chi-squared test with groups based on sorted model outputs. Since this grouping does not represent "natural" groupings in data space, we investigated a chi-squared test with grouping strategies in data space. Using a series of artificial data sets for which the correct models are known, and one real-world data set, we analyzed the performance of the Pigeon-Heyse test with groupings by self-organizing maps, k-means clustering, and random assignment of points to groups. We observed that the Pigeon-Heyse test offers slightly better performance than the Hosmer-Lemeshow test while being able to locate regions of poor calibration in data space.
KW - Classifier calibration
KW - Hosmer-Lemeshow test
KW - Pigeon-Heyse test
KW - goodness-of-fit tests
UR - http://www.scopus.com/inward/record.url?scp=84856910538&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-27549-4_46
DO - 10.1007/978-3-642-27549-4_46
M3 - Conference contribution
SN - 9783642275487
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 359
EP - 366
BT - Computer Aided Systems Theory, EUROCAST 2011 - 13th International Conference, Revised Selected Papers
T2 - 13th International Conference on Computer Aided Systems Theory EUROCAST 2011
Y2 - 6 February 2011 through 11 February 2011
ER -