The effectiveness of automatic evaluation of generative models is typically measured by comparing the labels generated via automation with human labels using correlation metrics. However, metrics like Krippendorff’s α and Randolph’s κ were originally designed to measure the reliability of human labeling, thus make assumptions about typical human labeling behavior, and these assumptions may not be applicable