Amazon’s online catalogue contains hundreds of millions of products, and millions of product listings are added and edited daily. Product data — images, titles, descriptions, and usage recommendations — must be complete, accurate, and appealing so that shoppers can find the products they are seeking quickly.
To ensure the quality of product data, Amazon has traditionally relied on specialized machine learning (ML) models, each optimized for an independent product category, from patio furniture to headphones. These models add or update information, identify inaccuracies, consolidate information, translate text into different languages, and incorporate data from third-party sources.
Such models work best for products with smaller, structured lists of attributes — dinner plates, for instance, which are well described by size, shape, color, and material. But there are many products in the catalogue with attributes that are much more complicated or nuanced that require specially trained ML models or manual review.
To ensure that the quality of product listings meets shopper's needs, we’ve turned to more adaptable and generalizable large language models (LLMs). When prompted with attribute data from the catalogue, LLMs adapt to the catalogue structures and vocabulary, allowing them to be usefully integrated into the quality control process. These catalogue AI solutions are correcting and updating product attributes at the scale of the Amazon Stores.
Prompt tuning
To adapt an LLM to the challenge of catalogue quality control, we needed to expose it to “knowledge” about the product catalogue. In other words, we needed to systematically introduce it to the attribute semantics and values that most accurately describe millions of products and product types. But first we needed to build that knowledge. That process starts with summarizing and organizing the entire catalogue by product type and attribute value, similar, in some ways, to grouping the rows of a very large, very complex spreadsheet.
Through this reorganization, we can see the range of seller-provided attribute values for various product types and, importantly, the statistics on how often and where those values appear. These statistics are fairly good indicators of a value’s correctness. If a higher number of products in a category uses a certain attribute value, for instance, or if products with a certain attribute value are more frequently viewed by customers, we can trust that the attribute is correct. Wireless headphones might have attributes that could appear as “Bluetooth”, “BT”, “BT 5.1”, or “Bluetooth version 5.1”, but the statistics will say that “Bluetooth” is the best candidate to use to inform our LLM.
While attribute statistics work well for many attributes, they won’t work for all of them, especially when there’s more nuance involved. One challenge with some attributes is their granularity, or how precisely they describe their products. An example is a surgical instrument with an attribute that might have the value “stainless steel” or “440 stainless steel”. The second is more granular; even though “stainless steel” is a more likely attribute value, we don’t want to eliminate “440 stainless steel.”
The way to keep such granularity in the catalogue is through an iterative process called prompt tuning, wherein general-purpose LLMs are exposed to particular schemas, rules, and terms that appear in the environment where they will be used. To add granularity to our LLM, we might prompt it with the phrase “The values returned must match the granularity, or broadness, of the values in the candidate list.” We can also ask an LLM for the reasoning behind its response, since this tends to improve its performance, as well as giving engineers insights that help them further fine-tune their prompts.
Prompt tuning is also how we handle other nuances of product description. These include ensuring consistency of representation, such as “men’s shirt” versus "men shirt”, and maintaining meaningful value representations, such as “4K UHD HDR” for a TV, which is more informative than “4K.”
After many rounds of prompt tuning, the LLM is ready to be exposed to the entire catalogue, where it performs three main tasks: recognizing standard attribute values, to establish correctness; collecting alternative representations of standard values, or synonyms; and detecting erroneous or nonsensical data entries.
The new process ensures that the latest seller values are included in the catalogue more quickly (within days) and saves thousands of hours in human reviews. What’s more, we’ve been able to use the LLM to increase the number of languages we can monitor and update. Our LLM-based method allows us to extend the quality control process into the furthest reaches of the catalogue, which would have been cost prohibitive to explore with our prior process.