Multimodal learning with online text cleaning for e-commerce product search

Zhizhang Hu; Shasha Li; Ming Du; Arnab Dhua; Douglas Gray

Publication

Multimodal learning with online text cleaning for e-commerce product search

By Zhizhang Hu, Shasha Li, Ming Du, Arnab Dhua, Douglas Gray

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Vision-language transformer models play a pivotal role in e-commerce product search. When using product description (e.g. product title) and product image pairs to train such models, there are often non-visual-descriptive text attributes in the product description, which makes the visual textual alignment challenging. We introduce MultiModal Learning with online Token Pruning (MML-TP). MML-TP leverages token pruning, conventionally used for computational efficiency, to perform online text cleaning during multimodal model training. Evaluation on the e-commerce dataset comprising over 710k unique Amazon products validates that refining text tokens enhances the paired image branch’s training, which leads to significantly improved visual search performance.

Multimodal learning with online text cleaning for e-commerce product search

Latest news

Work with us