The science behind Amazon's new StyleSnap for Home feature
StyleSnap for fashion and home features are made possible by use of multiple convolutional neural networks.
When people go shopping for fashion or home décor items, they mostly have an idea about what they want to purchase. However, when they sit down at a computer to search for such items, they often might not know the right fashion terms to find the products they are looking for quickly. This can lead customers to spend unnecessary time scrolling, for example, through hundreds of listings for a new desk, because they’re unaware that the type of desk they’re looking for is called a pedestal desk.
Amazon is addressing this challenge with StyleSnap, an AI-powered feature that helps customers use a photograph or screenshot to find products that inspire them. A customer uploads an image from social media or snaps a photo of a friend’s new dress, and within seconds similar products are displayed that they might be interested in buying.
The feature first launched in the US in 2019 with fashion as the main target, and has since expanded to Germany, Italy, France, Spain, the United Kingdom, and most recently, India.
Now, Amazon has launched StyleSnap for Home.
The StyleSnap for Fashion and Home are one way Amazon is providing customers with faster and easier online shopping experiences. StyleSnap for Home can help customers find furnishings to upgrade a living room space or deck out a home office. If a customer sees a desk they like on social media, instead of scrolling through hundreds of products to find something similar, they simply take a screenshot and upload it to StyleSnap via the Amazon app on their phone, and discover a variety of similar styles. Customers can even use the Prime filter to find items that can be delivered quickly.
StyleSnap was developed with deep learning and computer vision techniques, and takes advantage of convolutional neural networks (CNNs), which were initially developed for image recognition. Both the fashion and home versions of StyleSnap are made possible via the use of multiple CNNs, each with a specific task that spreads the work among the neural networks.
“We had to choose networks that were lightweight so images could be drawn fast enough to meet our customer response time targets,” said Arnau Ramisa, a senior applied scientist within Amazon’s Visual Search and Augmented Reality (VS&AR) team. “We use a system of networks focused on detection and classification, and then another network with a similar architecture, but that is slightly bigger, for comparing the customer and catalog product images.”
These CNNs were trained with hundreds of thousands of annotated images so that StyleSnap can analyze customer photographs and return results with suggestions similar to the photos. But the objective isn’t just to display items that look alike. The StyleSnap feature is designed to help people shop better, and that means providing them with the best results possible.
“If you read about advances in artificial intelligence, you might hear from some of the more technical publications that the recognition rate on images is over 95 percent,” said Doug Gray, senior applied science manager with the VS&AR team. “But then turning the latest research and machine learning models into actual products, which work on something like the entire Amazon catalog, is very challenging.”
Amazon scientists also considered a few variables for producing pleasing results: duplicate elimination, ratings and reviews, and similarity.
“Amazon has an extensive catalog of items, and with that comes the possibility for product duplicates and other challenges,” said Mengjiao Wang, an applied scientist with the VS&AR team.
After StyleSnap displays the results, customers can use typical filters such as Prime eligibility, price, and size.
Behind the scenes
When a customer uploads a photo of a home office chair they like, StyleSnap analyzes the photo. The CNN detects features within the image and translates it into a series of numerical representations of the image’s attributes. The CNN can then find the furniture objects in the image and classify them into several increasingly specific categories, such as home office and chair, for example.
The relevant objects found in the image are then fed into another CNN, which transforms them into a vector representation, and finds products within the Amazon catalog with similar vector representations. These similar products are displayed to the customer, after undergoing some post-processing steps to ensure quality.
During development, both the fashion and home versions of StyleSnap had to overcome some of the same challenges. The first of those was customer images that had occluded items, different perspectives, or noisy backgrounds. Because the primary source of training data for the deep learning models was initially the Amazon product catalog, it tended not to work as well on the imperfect snapshots that customers might take.
“There was a domain gap between in-the-wild, and in-the-catalog images,” said Amit Kumar K C, senior applied scientist. “The question we asked was, ‘How can we generate more of these in-the-wild images?’”
To simulate a more realistic customer backdrop, an object — a shoe, for example — was automatically segmented and pasted on different backgrounds, from a room to a street scene. By capturing an object in as many settings as possible, Kumar said, they were able to bridge the domain gap and improve performance.
The StyleSnap for Fashion and Home versions also had their own unique difficulties during development. For fashion, deformation and fabric texture presented challenges, while for home products it was item similarity and context.
An interesting synergy came from the augmented reality (AR) group within the same team, which developed an AR feature that allows customers to visualize how a product looks in their home before they buy it — powered by 3D product models. To help StyleSnap overcome potential snags like angle variation, the team used those 3D models, rendered onto a variety of backgrounds, as part of their training data.
“While 3D synthetic data is not a substitute for real images, it is much easier to generate,” said Gray. “We can show the network a million variations of viewpoint and lighting, which can help the networks fill in the gaps in the data captured from cameras.”
StyleSnap for Fashion expands to India
Neural networks can generally only identify classes of items that they’ve been trained to recognize — and StyleSnap was originally built in the US, using Western clothing items. As Amazon launched the feature in new countries such as France, the United Kingdom, and Spain, the team had to tweak its models to cater to different customers. But launching StyleSnap Fashion in India presented unique challenges.
When looking at two images, one of a woman wearing a kurti, or a traditional Indian tunic, and one of a woman wearing a Western tunic, humans can immediately recognize the difference relying on contextual cues such as color, pattern, and accessories.
But that’s a difficult challenge for computers. To ensure StyleSnap in the US and StyleSnap in India both return customer-satisfying results for each population, the team developed mechanisms to ensure high-quality matches. If someone in India uploads a photo of a kurti, they get kurti results back, and if someone in the US uploads a photo of a tunic, they’ll receive tunics in their results. This applies to all the cultural clothing in India, including saris, dhotis, and lungis.
There is also extra logic in place for StyleSnap Fashion in India to make sure that pictures of outfits will trigger culturally relevant accessories, instead of Western options.