Tabular out-of-distribution data synthesis for enhancing robustness
2024
Many critical machine learning applications in cybersecurity, healthcare and finance, encounter challenges like data privacy, distribution shifts and class imbalance. Often, minority class labels are scarce and may only be present for specific types of samples, which can pose challenges for developing effective models that handle new and unforeseen minority examples at inference time. Additionally, feeding sensitive data into downstream models is a significant privacy concern. Synthetic data generation offers a potential solution by enabling data privacy, creating data samples to rebalance the classes and also provides a way to generate out-of-distribution samples. We introduce TabOOD, a novel approach that generates synthetic tabular data samples to enhance robustness against unseen data and distribution shifts. TabOOD generates outof-distribution samples that could augment the training set, simulating unobserved scenarios and enhancing downstream model robustness. It also allows for the conditional generation of in-distribution minority and majority class samples. Building on recent advances in tabular data synthesis using latent diffusion models, our approach maps tabular data to class-dependent Gaussian mixture components in a latent space, thereby separating latent representations, before training diffusion models on the latent space. We further manipulate the latent space to generate atypical, boundary data points. Experimental results across different datasets demonstrate that TabOOD significantly improves the performance of downstream models when faced with distribution shifts or novel out-of-distribution samples, offering a more balanced and robust approach to tabular data learning.
Research areas