Amazon Web Services releases two new Titan vision-language models
Novel architectures and carefully prepared training data enable state-of-the-art performance.
Last month, at its annual re:Invent developers’ conference, Amazon Web Services (AWS) announced the release of two new additions to its Titan family of foundation models, both of which translate between text and images.
With Amazon Titan Multimodal Embeddings, now available through Amazon Bedrock, customers can upload their own sets of images and then search them using text, related images, or both. The data representations generated by the model can also be used as inputs for downstream machine learning tasks.
The Amazon Titan Image Generator, which is in preview, is a generative-AI model, trained on photographs and captions and able to produce photorealistic images. Again, it can take either text or images as input, generating a set of corresponding output images.
The models have different architectures and were trained separately, but they do share one component: the text encoder.
The embedding model has two encoders, a text encoder and an image encoder, which produce vector representations — embeddings — of their respective inputs in a shared multidimensional space. The model is trained through contrastive learning: it’s fed both positive pairs (images and their true captions) and negative pairs (images and captions randomly sampled from other images), and it learns to push the embeddings of the negative examples apart and pull the embeddings of the positive pairs together.
The image generator uses two copies of the embedding model’s text encoder. One copy feeds the text embedding directly to an image generation module. The second copy feeds its embedding to a separately trained module that attempts to predict the corresponding image embedding. The predicted image embedding also passes to the image generation model.
The image generated by the model then passes to a second image generation module, which also receives the input-text embedding as input. The second image generation model “super-resolves” the output of the first, increasing its resolution — and, Amazon researchers’ experiments show, improving the alignment between text and image.
Beyond the models’ architecture, one of the keys to their state-of-the-art performance is the careful preparation of their training data. The first stage in the process was de-duplication, which is a bigger concern than may be obvious. Many data sources use default images to accompany content with no images otherwise provided, and these default images can be dramatically overrepresented in training data. A model that spends too many resources on a handful of default images won’t generalize well to new images.
One way to identify duplicates would be to embed all the images in the dataset and measure their distances from each other in the embedding space. But when every image has to be checked against all the others, this would be enormously time consuming. Amazon scientists found that instead using perceptual hashing, which produces similar digital signatures for similar images, enabled effective and efficient de-duplication.
To ensure that only high-quality images were used to train the models, the Amazon scientists relied on a separate machine learning model, an image-quality classifier trained to emulate human aesthetic judgments. Only those images whose image-quality score was above some threshold were used to train the Titan models.
That helped with the problem of image quality, but there was still the question of image-caption alignment. Even high-quality, professionally written image captions don’t always describe image contents, which is the information a vision-language model needs. So the Amazon scientists also built a caption generator, trained on images with descriptive captions.
During each training epoch, a small fraction of images fed to the Titan models would be recaptioned with captions produced by the generator. If the original captions described the image contents well, replacing them for one epoch would make little difference; but if they didn’t, the substitution would give the model valuable information that it wouldn’t otherwise have.
The data and captions were also carefully curated to reduce the risk of generating inappropriate or offensive images. Generated images also include invisible digital watermarks that identify them as synthetic content.
After pretraining on the cleaned dataset, the image generation model was further fine-tuned on a small set of very high-quality images with very descriptive captions, selected to cover a diverse set of image classes. The Amazon researchers’ ablation studies show that this fine-tuning significantly improved image-text alignment and reduced the likelihood of unwanted image artifacts, such as deformations of familiar objects.
In ongoing work Amazon scientists are working to increase the resolution of the generated images still further.