Visual item selection with voice assistants

Prashan Wanigasekara; Rafid Al-Humaimidi; Turan Gojayev; Niloofar Gheissari; Achal Dave; Stephen Rawls; Fan Yang; Kechen Qin; Nalin Gupta; Spurthi Sandiri; Chevanthie Dissanayake; Zeynab Raeesy; Emre Barut; Chengwei Su

Publication

Visual item selection with voice assistants

By Prashan Wanigasekara, Rafid Al-Humaimidi, Turan Gojayev, Niloofar Gheissari, Achal Dave, Stephen Rawls, Fan Yang, Kechen Qin, Nalin Gupta, Spurthi Sandiri, Chevanthie Dissanayake, Zeynab Raeesy, Emre Barut, Chengwei Su

2023

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Interacting with voice assistants, such as Amazon Alexa to aid in day-to-day tasks has become a ubiquitous phenomenon in modernday households. These voice assistants often have screens to provide visual content (e.g., images, videos) to their users. There is an increasing trend of users shopping or searching for products using these devices, yet, these voice assistants do not support commands or queries that contain visual references to the content shown on screen (e.g., “blue one”, “red dress”). We introduce a novel multimodal visual shopping experience where the voice assistant is aware of the visual content shown on the screen and assists the user in item selection using natural language multi-modal interactions. We detail a practical, lightweight end-to-end system architecture spanning from model fine-tuning, deployment, to skill invocation on an Amazon Echo family device with a screen. We also define a niche “Visual Item Selection” task and evaluate whether we can effectively leverage publicly available multi-modal models, and embeddings produced from these models for the task. We show that open source contrastive embeddings like CLIP [30] and ALBEF [24] have zeroshot accuracy above 70% for the “Visual Item Selection” task on an internally collected visual shopping dataset. By further fine-tuning the embeddings, we obtain further gains of 8.6% to 24.0% in relative accuracy improvement over a baseline. The technology that enables our visual shopping assistant is available as an Alexa Skill in the Alexa Skills store.

Visual item selection with voice assistants

Latest news

Work with us