Multimodal learning with online text cleaning for e-commerce product search
2024
Vision-language transformer models play a pivotal role in e-commerce product search. When using product description (e.g. product title) and product image pairs to train such models, there are often non-visual-descriptive text attributes in the product description, which makes the visual textual alignment challenging. We introduce MultiModal Learning with online Token Pruning (MML-TP). MML-TP leverages token pruning, conventionally used for computational efficiency, to perform online text cleaning during multimodal model training. Evaluation on the e-commerce dataset comprising over 710k unique Amazon products validates that refining text tokens enhances the paired image branch’s training, which leads to significantly improved visual search performance.
Research areas