Interacting with voice assistants, such as Amazon Alexa to aid in day-to-day tasks has become a ubiquitous phenomenon in modernday households. These voice assistants often have screens to provide visual content (e.g., images, videos) to their users. There is an increasing trend of users shopping or searching for products using these devices, yet, these voice assistants do not support commands or queries that contain visual references to the content shown on screen (e.g., “blue one”, “red dress”). We introduce a novel multimodal visual shopping experience where the voice assistant is aware of the visual content shown on the screen and assists the user in item selection using natural language multi-modal interactions. We detail a practical, lightweight end-to-end system architecture spanning from model fine-tuning, deployment, to skill invocation on an Amazon Echo family device with a screen. We also define a niche “Visual Item Selection” task and evaluate whether we can effectively leverage publicly available multi-modal models, and embeddings produced from these models for the task. We show that open source contrastive embeddings like CLIP [30] and ALBEF [24] have zeroshot accuracy above 70% for the “Visual Item Selection” task on an internally collected visual shopping dataset. By further fine-tuning the embeddings, we obtain further gains of 8.6% to 24.0% in relative accuracy improvement over a baseline. The technology that enables our visual shopping assistant is available as an Alexa Skill in the Alexa Skills store.
Research areas