Estimating precisions for multiple binary classifiers under limited samples
2020
Machine learning classifiers often require regular tracking of performance measures such as precision, recall, F1-score, etc., for model improvement and diagnostics. The population over which accuracy metrics are evaluated can be too large for a full ground-truth assessment and so only small random samples are chosen for estimation. Ground-truthing often requires human review, which is expensive. Moreover, in some business applications, it may be preferable to minimize human contact with the data in order to improve privacy safeguards. Thus, sampling methods that can provide estimates with low margin of error, high confidence, and small sample size are highly desirable. With an ensemble of multiple binary classifiers, choosing the right sampling method with these desired properties and small size for the collective sample becomes even more important. We propose a sampling method to estimate the precisions of multiple binary classifiers that exploits the overlaps between their prediction sets. We provide theoretical guarantees that our estimators are unbiased and empirically demonstrate that the precision metrics estimated from our sampling technique are as good (in terms of variance and confidence interval) as those obtained from a uniform random sample. We applied our sampling technique to performance evaluation of an ensemble of binary classifiers. The reduction in sample size depends on the extent of overlap between the predicted positive set of the ensemble and that of the individual classifiers. Since we do not have a closed form solution for quantifying the impact of the overlap, we relied on simulations to investigate how the overlap between an ensemble (parent) and component (child) classifier affects the overall sample size. We found that for every combination of parent and child intersection ratio we tested on, there were significant savings in sample size. Moreover, across all these simulations, we found a mean reduction of 33% in the sample size needed from a child. Our simulations also confirm that the precision metrics estimated from the samples generated using our sampling technique have accuracy comparable to those estimated from uniform random sampling
Research areas