RMIR: A benchmark dataset for reasoning-intensive multimodal image retrieval
2026
Current multimodal image retrieval benchmarks focus on relatively simple queries where target images are either described directly or by simple composition with an input image. When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of 1,634 queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. In addition to the dataset itself, we present a pipeline to generate the dataset, which can be used to generate additional reasoning-intensive retrieval data at scale. Evaluation of state-of-the-art models on RMIR reveals significant performance gaps, with the best model achieving only 46.53% recall@20 averaged across reasoning categories. Our evaluation also shows that generative embedding models with explicit reasoning substantially outperform discriminative approaches, with reasoning-aware training proving more impactful than model scale. Our systematic analysis exposes fundamental limitations in current multimodal retrieval systems and establishes RMIR as a challenging testbed for developing multimodal, reasoning-capable retrieval models. Our dataset and code are available at https://github.com/amazon-science/rmir
Research areas