Computer vision

Vision-language models that can handle multi-image inputs

Attention-based representation of multi-image inputs improves performance on downstream vision-language tasks.

By Wenyi Wu, Qi Li

January 19, 2024

4 min read

Vision-language models, which map images and text to a common representational space, have demonstrated remarkable performance on a wide range of multimodal AI tasks. But they’re typically trained on text-image pairs: each text input is associated with a single image.

This limits the models’ applicability. You might, for instance, want a vision-language model to take two input images and identify differences between them, or you might want to make inferences from a 3-D fusion of ultrasound or x-ray cross sections. In the Amazon Store, multiple images are frequently associated with a single product, and you might want to execute a query that factors in several of those images.

The standard way around this limitation is to concatenate a set of images and feed them to a model as, essentially, one enormous image. But this misses an opportunity to create a richer representation — or embedding — that systematically draws on complementary information from multiple images.

Model architecture

Vision-language models typically involve an image encoder, which produces an embedding of an input image, and a projection layer, which learns to project the image embedding into the representational space of a trained large language model (LLM).

Sometimes, a query embedding generator intervenes between the image encoder and the projection layer. The query embedding generator is trained on a combination of image embeddings and the associated image captions, so it learns linguistic representations of the image embeddings that can help the projection layer better navigate the LLM’s representational space.

Two parallel image-embedding pipelines, taking as input the same image of a white sofa with pictures hanging above it. One pipeline includes a query embedding generator and one does not. — In a typical vision-language model, an image embedding passes to a projection layer that projects the embedding into the representational space of a trained LLM. Sometimes, a query embedding generator intervenes between the image encoder and the projection layer.

We introduce a multiple-instance visual component (MIVC) that, in either architecture, receives the output of the visual encoder, creating a unified representation of multiple input images.

The two parallel pipelines from earlier, each alongside a modified version of itself in which the MIVC layer, depicted as a green bubble with an arrow extending to a green array, is added, for a total of four pipelines. — Both vision-language model architectures, with and without the addition of the multiple-instance visual component (MIVC).

Permutation-invariant attention

The visual encoder learns to recognize features of the input data — which might be low-level properties like color gradients across image patches or higher-level properties like particular shapes — and assigns each input a value along each feature dimension.

Novel “checkpointing” scheme that uses CPU memory reduces the time wasted on failure recovery by more than 92%.

Because we fine-tuned the attention mechanism on the target task, we fine-tuned the baseline model, too, to ensure fair comparison. But on the attribute inference and captioning tasks, fine-tuning actually diminished the baseline model’s performance. If we use the zero-shot concatenated-image model as the baseline, the improvements offered by our method shrink slightly: on the image-captioning task, our advantage contracts to 5.6%, and on the product attribute inference task, the advantages on precision and recall contract to 5.5% and 7%. But that’s still a significant difference.

At present, the attention mechanism applies only to the visual encoding pipeline, and it operates under the assumption that all images are independently and identically distributed. In ongoing work, we’re investigating whether cross-modal attention and incorporating correlations across images offer any further improvements.

About the Author

Wenyi Wu

Wenyi Wu is a senior applied scientist at Amazon.

Qi Li

Qi Li is an applied scientist at Amazon.

Vision-language models that can handle multi-image inputs

Attention-based representation of multi-image inputs improves performance on downstream vision-language tasks.

Model architecture

Permutation-invariant attention

Related content

Work with us