Benchmarking robustness under distribution shift of multimodal image-text models

Jielin Qiu; Yi Zhu; Xingjian Zhen; Zhiqiang Tang; Ding Zhao; Bo Li; Mu Li

Publication

Benchmarking robustness under distribution shift of multimodal image-text models

By Jielin Qiu, Yi Zhu, Xingjian Zhen, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li

2022

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Multimodal image-text models have shown remarkable performance in the past few years. However, the robustness of such foundation models against distribution shifts is crucial in downstream applications. In this paper, we investigate their robustness under image and text perturbations. We first build several multimodal benchmark datasets by applying 17 image perturbation and 16 text perturbation techniques. Then we extensively study the robustness of 6 widely adopted models on 3 downstream tasks (image-text retrieval, visual reasoning, and visual entailment). We observe that these powerful multimodal models are sensitive to image/text perturbations, especially to image perturbations. For text, character-level perturbations have shown higher adversarial impact than word-level and sentence-level perturbations. Wealso observe that models trained by generative objectives tend to be more robust. Our findings in terms of robustness study could facilitate the development of large image-text models, as well as their deployment for real-world applications.

Benchmarking robustness under distribution shift of multimodal image-text models

Latest news

Work with us