CMA-CLIP: Cross-modality attention clip for text-image classification

Jinmiao Fu; Shaoyuan Xu; Huidong Liu; Yang Liu; Ning Xie; Chien-Chih Wang; Jia Liu; Yi Sun; Bryan Wang

Publication

CMA-CLIP: Cross-modality attention clip for text-image classification

By Jinmiao Fu, Shaoyuan Xu, Huidong Liu, Yang Liu, Ning Xie, Chien-Chih Wang, Jia Liu, Yi Sun, Bryan Wang

2022

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Multi-modal learning with both text and images benefits multiple applications, such as attribute extraction for e-commerce products. In this paper, we propose Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new multi-modal architecture to jointly learn the fine-grained inter-modality relationship. It fuses CLIP with a sequence-wise attention module and a modality-wise attention module. The network uses CLIP to bridge the inter-modality gap at the global level, and uses the sequence-wise attention module to capture the fine-grained alignment between text and images. Besides, it leverages a modality-wise attention module to learn the relevance of each modality to downstream tasks, making the network robust against irrelevant modalities. CMA-CLIP outperforms the state-of-the-art method on Fashion-Gen by 5.5% in accuracy, achieves competitive performance on Food101 and performance on par with the state-of-the-art method on MM-IMDb. We also demonstrate CMA-CLIP’s robustness against irrelevant modalities on an Amazon dataset for the task of product attribute extraction.

CMA-CLIP: Cross-modality attention clip for text-image classification

Latest news

Work with us