Computer vision

Amazon Web Services releases two new Titan vision-language models

Novel architectures and carefully prepared training data enable state-of-the-art performance.

December 20, 2023

4 min read

Last month, at its annual re:Invent developers’ conference, Amazon Web Services (AWS) announced the release of two new additions to its Titan family of foundation models, both of which translate between text and images.

This graphic shows embeddings of text phrases in a representational space, a question "who won the 2022 world cup" and two answers "Messi secures first World Cup after extra-time drama" and "France wins in highest-scoring World Cup final since 1996" are linked to dots in the space, the Messi answer is closer to the question

AWS service enables machine learning innovation on a robust foundation.

With Amazon Titan Multimodal Embeddings, now available through Amazon Bedrock, customers can upload their own sets of images and then search them using text, related images, or both. The data representations generated by the model can also be used as inputs for downstream machine learning tasks.

The Amazon Titan Image Generator, which is in preview, is a generative-AI model, trained on photographs and captions and able to produce photorealistic images. Again, it can take either text or images as input, generating a set of corresponding output images.

Image Generator.png — Examples of images produced by the Amazon Titan Image Generator model, and the prompts that elicited them.

The models have different architectures and were trained separately, but they do share one component: the text encoder.

The embedding model has two encoders, a text encoder and an image encoder, which produce vector representations — embeddings — of their respective inputs in a shared multidimensional space. The model is trained through contrastive learning: it’s fed both positive pairs (images and their true captions) and negative pairs (images and captions randomly sampled from other images), and it learns to push the embeddings of the negative examples apart and pull the embeddings of the positive pairs together.

Data preparation

Beyond the models’ architecture, one of the keys to their state-of-the-art performance is the careful preparation of their training data. The first stage in the process was de-duplication, which is a bigger concern than may be obvious. Many data sources use default images to accompany content with no images otherwise provided, and these default images can be dramatically overrepresented in training data. A model that spends too many resources on a handful of default images won’t generalize well to new images.

One way to identify duplicates would be to embed all the images in the dataset and measure their distances from each other in the embedding space. But when every image has to be checked against all the others, this would be enormously time consuming. Amazon scientists found that instead using perceptual hashing, which produces similar digital signatures for similar images, enabled effective and efficient de-duplication.

Amazon Web Services releases two new Titan vision-language models

Novel architectures and carefully prepared training data enable state-of-the-art performance.

Data preparation

Related content

Work with us