CMA-CLIP: Cross-modality attention clip for text-image classification
Multi-modal learning with both text and images benefits multiple applications, such as attribute extraction for e-commerce products. In this paper, we propose Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new multi-modal architecture to jointly learn the fine-grained inter-modality relationship. It fuses CLIP with a sequence-wise attention module and a modality-wise attention module. The network uses CLIP to bridge the inter-modality gap at the global level, and uses the sequence-wise attention module to capture the fine-grained alignment between text and images. Besides, it leverages a modality-wise attention module to learn the relevance of each modality to downstream tasks, making the network robust against irrelevant modalities. CMA-CLIP outperforms the state-of-the-art method on Fashion-Gen by 5.5% in accuracy, achieves competitive performance on Food101 and performance on par with the state-of-the-art method on MM-IMDb. We also demonstrate CMA-CLIP’s robustness against irrelevant modalities on an Amazon dataset for the task of product attribute extraction.