RMIR: A benchmark dataset for reasoning-intensive multimodal image retrieval

Yijiang Li; Kunal Kotian; Ali Marjaninejad; Meir Friedenberg; Kaushik Pavani; Sunny Dasgupta

Publication

RMIR: A benchmark dataset for reasoning-intensive multimodal image retrieval

By Yijiang Li, Kunal Kotian, Ali Marjaninejad, Meir Friedenberg, Kaushik Pavani, Sunny Dasgupta

2026

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Current multimodal image retrieval benchmarks focus on relatively simple queries where target images are either described directly or by simple composition with an input image. When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of 1,634 queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. In addition to the dataset itself, we present a pipeline to generate the dataset, which can be used to generate additional reasoning-intensive retrieval data at scale. Evaluation of state-of-the-art models on RMIR reveals significant performance gaps, with the best model achieving only 46.53% recall@20 averaged across reasoning categories. Our evaluation also shows that generative embedding models with explicit reasoning substantially outperform discriminative approaches, with reasoning-aware training proving more impactful than model scale. Our systematic analysis exposes fundamental limitations in current multimodal retrieval systems and establishes RMIR as a challenging testbed for developing multimodal, reasoning-capable retrieval models. Our dataset and code are available at https://github.com/amazon-science/rmir

RMIR: A benchmark dataset for reasoning-intensive multimodal image retrieval

Latest news

Work with us