DEFT-VTON: Efficient virtual try-on with consistent generalised h-transform
2025
Diffusion models enables high-quality virtual try-on (VTO) with their established image synthesis abilities. Despite the extensive end-to-end training of large pre-trained models involved in current VTO methods, real-world applications often prioritize limited training and inferencing/serving/deployment budgets for VTO. To solve this obstacle, we apply Doob’s h-transform efficient fine-tuning (DEFT) for adapting large pre-trained unconditional models for downstream image-conditioned VTO abilities. DEFT freezes the pre-trained model’s parameters and trains a small h-transform network to learn a conditional h-transform. The h-transform network allows to train only 1.42% of the frozen parameters, compared to baseline 5.52% in traditional parameter-efficient fine-tuning (PEFT). To further improve DEFT’s performance, and decrease existing models’ inference time, we additionally propose an adaptive consistency loss. Consistency training distills slow but performing diffusion models into a fast one while retaining performances by enforcing consistencies along the inference path. Inspired by constrained optimization, instead of distillation, we combine the consistency loss and the denoising score matching loss in a data adaptive manner for fine-tuning existing VTO models at a low cost. Empirical results show proposed DEFT-VTON method achieves SOTA performances on VTO tasks, as well as a number of function evaluations (denoising steps) as low as 15, while maintaining competitive performances.
Research areas