Using generative AI to do multimodal information retrieval

With large datasets, directly generating data ID codes from query embeddings is much more efficient than performing pairwise comparisons between queries and candidate responses.

By Sungyeon Kim, Xiaofan Lin

June 25, 2025

For most of the past 10 years, machine learning (ML) relied heavily on the concept of embedding: an ML model would learn to convert input data into vectors (embeddings) such that geometric relationships within the vector space had semantic implications. For instance, words whose embeddings were near each other in the representational space might have similar meanings.

The concept of embedding implied an obvious information retrieval paradigm: a query would be embedded in the representational space, and the model would select the response whose embedding was closest to it. This worked with multimodal information retrieval, too, as text and images (or other modalities) could be embedded in the same space.

More recently, however, generative AI has come to dominate ML research, and at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR), we presented a paper that updates ML-based information retrieval for the generative-AI era. Our model, dubbed GENIUS (for generative universal multimodal search), is a multimodal model whose inputs and outputs can be any combination of images, texts, or image-text pairs.

Comparison of search methods.png — With embedding-based retrieval *(a)*, a text embedding must be compared to every possible image embedding, or vice versa. With generative retrieval *(b and c)*, by contrast, a retrieval model generates a single ID for each query. With GENIUS *(c)*, the first digit of the ID code indicates the modality of the output.

Instead of comparing a query vector to every possible response vector — a time-consuming task, if the image catalogue or text corpus is large enough — our model takes a query as input and generates a single ID code as output. This approach has been tried before, but GENIUS dramatically improves on previous generation-based information retrieval methods. In tests on two different datasets using three different metrics — retrieval accuracy when one, five, or ten candidate responses are retrieved — GENIUS improves on the best-performing prior generative retrieval model by 22% to 36%.

When we then use conventional embedding-based methods to rerank the top generated response candidates, we improve performance still further, by 31% to 56%, significantly narrowing the gap between generation-based methods and embedding-based methods.

Paradigm shift

Information retrieval (IR) is the process of finding relevant information from a large database. With traditional embedding-based retrieval, queries and database items are both mapped into a high-dimensional space, and similarity is measured using metrics like cosine similarity. While effective, these methods face scalability issues as the database grows, due to the increasing cost of index building, maintenance, and nearest-neighbor search.

Generative retrieval has emerged as a promising alternative. Instead of embedding items, generative models directly generate identifiers (IDs) of target data based on a query. This approach enables constant-time retrieval, regardless of database size. However, existing generative methods are often task specific, falling short in performance compared to embedding-based methods, and they struggle with multimodal data.

GENIUS

Unlike prior approaches that are limited to single-modality tasks or specific benchmarks, GENIUS generalizes across retrieval of texts, images, and image-text pairs, maintaining high speed and competitive accuracy. Its advantages over prior generation-based models are based on two key innovations:

Semantic quantization: During training, the model’s target output IDs are generated through residual quantization. Each ID is actually a sequence of codes, the first of which defines the data item’s modality — image, text, or image-text pair. The successive codes define the data item’s region of the representational space with greater specificity: items that share the first code are in the same general area; items that share the first two codes are clustered more tightly in that area; items that share the first three codes are clustered more tightly still, and so on. The model tries to learn to reproduce the sequence of codes from the input encodings.

Query augmentation: This approach results in a model that can generate accurate ID codes for familiar types of objects and texts, but it can struggle to generalize to new data types. To address this limitation, we use query augmentation. For a representative sampling of query-ID pairs, we generate new queries by interpolating between the initial query and the target ID in the representational space. This way, the model learns that a variety of queries can map to the same target, which helps it generalize.

GENIUS framework.png — The GENIUS framework. *Stage 0:* Separate image and text encoders are pretrained. *Stage 1:* Through contrastive training, a residual-quantization module learns to map inputs to sequences of codes in which each code refines the coarser-grained specifications of the preceding code. *Stage 2:* A decoder is trained to generate output IDs directly from input encodings, using the outputs of the residual-quantization model as targets. At inference time, output codes are constrained by a data structure known as a trie, a tree whose traversals encode sequences of symbols.

Results

In experiments using the M-BEIR benchmark, GENIUS surpassed the best generative retrieval method by 28.6 points in Recall@5 on the COCO dataset for text-to-image retrieval. With embedding-based re-ranking, GENIUS often achieved results close to those of embedding-based baselines on the M-BEIR benchmark while preserving the efficiency benefits of generative retrieval.

GENIUS achieves state-of-the-art performance among generative methods and narrows the performance gap between generative and embedding-based methods. Its efficiency advantage becomes more significant as the dataset grows, maintaining high retrieval speed without the expensive index building typical of embedding-based methods. It thus represents a significant step forward in generative multimodal retrieval.

About the Author

Sungyeon Kim

Sungyeon Kim is a postdoctoral researcher at the Computer Vision Lab at the Pohang University of Science and Technology (POSTECH). He was an intern at Amazon when the work was done.

Xiaofan Lin

Xiaofan Lin is a senior applied scientist with Amazon's search organization.

Using generative AI to do multimodal information retrieval

With large datasets, directly generating data ID codes from query embeddings is much more efficient than performing pairwise comparisons between queries and candidate responses.

Paradigm shift

GENIUS

Results

Related content

Work with us