Self-supervision improves diffusion models for tabular data imputation
2024
Incomplete tabular datasets are ubiquitous in many applications for a number of reasons such as human error in data collection or privacy considerations. One would expect a natural solution for this is to utilize powerful generative models such as diffusion models, which have demonstrated great potential across image and continuous domains. However, vanilla diffusion models often exhibit sensitivity to initialized noises. This, along with the natural sparsity inherent in the tabular domain, poses challenges for diffusion models, thereby impacting the robustness of these models for data imputation. In this work, we propose an advanced diffusion model named Self-supervised imputation Diffusion Model (SimpDM for brevity), specifically tailored for tabular data imputation tasks. To mitigate sensitivity to noise, we introduce a self-supervised alignment mechanism that aims to regularize the model, ensuring consistent and stable imputation predictions. Furthermore, we introduce a carefully devised state-dependent data augmentation strategy within SimpDM, enhancing the robustness of the diffusion model when dealing with limited data. Extensive experiments demonstrate that Sim-pDM matches or outperforms state-of-the-art imputation methods across various scenarios.
Research areas