We present Amazon Nova Multimodal Embeddings (MME), a state-of-the-art multimodal embedding model for agentic RAG and semantic search applications. Nova MME is the first embeddings model that supports five modalities as input: text, documents, images, video and audio, and transforms them into a single, unified embedding space. This powerful capability enables cross-modal retrieval —allowing users to search and find relevant information across different types of data. Nova MME is designed to interpret text, images, video, documents and audio, converting them into numerical representations known as embeddings. These embeddings capture the semantic meaning of the underlying content, making it possible to compare, search, and perform reasoning tasks across modalities. By calculating the distance between embeddings, customers can power a wide range of intelligent applications — from semantic search and RAG-powered Large Language Models (LLMs) to content classification and beyond. It is the first unified embedding model that supports text, documents, images, video, and audio through a single model, giving developers the flexibility and performance needed to build next-generation AI solutions.
This report was published on October 28, 2025.
Amazon Nova Multimodal Embeddings: Technical report and model card
2025
Research areas