Prune then distill: Dataset distillation with importance sampling
2023
The development of large datasets for various tasks has driven the success of deep learning models but at the cost of increased label noise, duplication, collection challenges, storage capabilities, and training requirements. In this work, we investigate whether all samples in large datasets contribute equally to better model accuracy. We study statistical and mathematical techniques to reduce redundancies in datasets by directly optimizing data samples for the generalization accuracy of deep learning models. Existing dataset optimization approaches include analytic methods that remove unimportant samples and synthetic methods that generate new datasets to maximize the generalization accuracy. We develop Prune then distill, a combination of analytic and synthetic dataset optimization algorithms, and demonstrate up to 15% relative improvement in generalization accuracy over either approach used independently on standard image and audio classification tasks. Additionally, we demonstrate up to 38% improvement in generalization accuracy of dataset pruning algorithms by maintaining class balance while pruning.
Research areas