MEDAL: Multi-modal meta-space distillation and algnment for visual compatibility learning
2026
Visual compatibility recommendation systems aim to surface compatible items (e.g. pants, shoes) that harmonise with a user-selected product (e.g., shirt). Existing methods struggle in three key aspects: they rely on global CNN representations that overlook fine-grained local cues critical for visual pairing; they force all categories into a single latent space, ignoring the fact that compatibility rules differ across product-type pairs; and they demand costly, expert-annotated outfit labels. We introduce MEDAL(Meta-space Distillation and Alignment ), a self-supervised framework that addresses all three challenges simultaneously. MEDAL (i) employs a local–global augmentation curriculum inside a teacher–student ViT to emphasise patch-level texture and pattern similarities while suppressing confounding global shape cues; (ii) partitions the joint feature manifold into learnable, pair-specific meta-spaces so that, for example, {shirt,pants} and {pants,shoes} relationships are modelled with distinct projection masks; and (iii) replaces manual labels with distantly supervised KD, harvesting pseudo-compatible sets via object detection on web images, thus scaling to millions of real-world examples. We further fuse perceptually uniform LUV colour histograms to capture global colour harmony often missed by pure vision transformers. Extensive experiments on Polyvore disjoint/non-disjoint and a 2M-image in-house dataset show state-of-the-art gains of up to +3.72/+2.7FITB and +9.58R@10 over the strongest baseline, whilst cutting annotation cost to zero. Qualitative studies confirm that MEDAL retrieves stylistically coherent outfits and correctly penalises mismatched colour palettes.
Research areas