KG-FLIP: Knowledge-guided fashion-domain language-image pre-training for e-commerce
2023
Various Vision-Language Pre-training (VLP) models (e.g., CLIP, BLIP) have sprung up and dramatically improved the benchmarks of public general-domain datasets (e.g., COCO, Flickr30k). Such models typically learn the cross-modal alignment from large-scale well-aligned image-text datasets. Adapting these models to downstream applications in specific domains, such as fashion, requires fine-grained in-domain image-text datasets. However, such datasets are usually less semantically aligned and smaller in scale, which requires more efficient pre-training strategies. In this paper, we propose a knowledge-guided fashion-domain language-image pre-training (KG-FLIP) frame-work that focuses on learning fine-grained representations in the e-commerce domain and utilizes external knowledge (i.e., product at-tribute schema) to improve the pre-training efficiency. Experimental results demonstrate that KG-FLIP outperforms previous state-of-the-art VLP models on Amazon data and the Fashion-Gen dataset by large margins. KG-FLIP has been successfully deployed in the Amazon catalog system to backfill missing attributes and improve the customer shopping experience.
Research areas