Building product graphs automatically

Automated system tripled the number of facts in a product graph.

Knowledge graphs are data structures that capture relationships between data in a very flexible manner. They can help make information retrieval more precise, and they can also be used to uncover previously unknown relationships in large data sets.

Manually assembling knowledge graphs is extremely time consuming, so researchers in the field have long been investigating techniques for producing them automatically. The approach has been successful for domains such as movie information, which feature relatively few types of relationships and abound in sources of structured data.

Automatically producing knowledge graphs is much more difficult in the case of retail products, where the types of relationships between data items are essentially unbounded — color for clothes, flavor for candy, wattage for electronics, and so on — and where much useful information is stored in free-form product descriptions, customer reviews, and question-and-answer forums.

AutoKnow.png
The inputs to AutoKnow include an existing product taxonomy, user logs, and a product catalogue. AutoKnow automatically combines data from all three sources into a product graph, adding new product types to the taxonomy, adding new values for product attributes, correcting errors, and identifying synonyms.
Credit: Stacy Reilly

This year, at the Association for Computing Machinery’s annual conference on Knowledge Discovery and Data Mining (KDD), my colleagues and I will present a system we call AutoKnow, a suite of techniques for automatically augmenting product knowledge graphs with both structured data and data extracted from free-form text sources.

With AutoKnow, we increased the number of facts in Amazon’s consumables product graph (which includes the categories grocery, beauty, baby, and health) by almost 200%, identifying product types with 87.7% accuracy.

We also compared each of our system’s five modules, which execute tasks such as product type extraction and anomaly detection, to existing systems and found that they improved performance across the board, often quite dramatically (an improvement of more than 300% in the case of product type extraction).

The AutoKnow framework

Knowledge graphs typically consist of entities — the nodes of the graph, often depicted as circles — and relations between the entities — usually depicted as line segments connecting nodes. The entity “drink”, for example, might be related to the entity “coffee” by the relationship “contains”. The entity “bag of coffee” might be related to the entity “16 ounces” by the relationship “has_volume”.

In a narrow domain such as movie information, the number of entity types — such as director, actor, and editor — is limited, as are the number of relationships — directed, performed in, edited, and so on. Moreover, movie sources often provide structured data, explicitly listing cast and crew.

In a retail domain, on the other hand, the number of product types tends to grow as the graph expands. Each product type has its own set of attributes, which may be entirely different from the next product type’s — color and texture, for instance, versus battery type and effective range. And the vital information about a product — that a coffee mug gets too hot to hold, for instance — could be buried in the free-form text of a review or question-and-answer section.

AutoKnow addresses these challenges with five machine-learning-based processing modules, each of which builds on the outputs of the one that precedes it:

  1. Taxonomy enrichment extends the number of entity types in the graph;
  2. Relation discovery identifies attributes of products, those attributes’ range of possible values (different flavors or colors, for instance), and, crucially, which of those attributes are important to customers;
  3. Data imputation uses the entity types and relations discovered by the previous modules to determine whether free-form text associated with products contains any information missing from the graph;
  4. Data cleaning sorts through existing and newly extracted data to see whether any of it was misclassified in the source texts; and
  5. Synonym finding attempts to identify entity types and attribute values that have the same meaning.

The ontology suite

The inputs to AutoKnow include an existing product graph; a catalogue of products that includes some structured information, such as labeled product names, and unstructured product descriptions; free-form product-related information, such as customer reviews and sets of product-related questions and answers; and product query data.

To identify new products, the taxonomy enrichment module uses a machine learning model that labels substrings of the product titles in the source catalogue. For instance, in the product title “Ben & Jerry’s black cherry cheesecake ice cream”, the model would label the substring “ice cream” as the product type.

The same model also labels substrings that indicate product attributes, for use during the relation discovery step. In this case, for instance, it would label “black cherry cheesecake” as the flavor attribute. The model is trained on product descriptions whose product types and attributes have already been classified according to a hand-engineered taxonomy.

Next, the taxonomy enrichment module classifies the newly extracted product types according to their hypernyms, or the broader product categories that they fall under. Ice cream, for instance, falls under the hypernym “Ice cream and novelties”, which falls under the hypernym “Frozen”, and so on.

The hypernym classifier uses data about customer interactions, such as which products customers viewed or purchased after a single query. Again, the machine learning model is trained on product data labeled according to an existing taxonomy.

Relation discovery

The relation discovery module classifies product attributes according to two criteria. The first is whether the attribute applies to a given product. The attribute flavor, for instance, applies to food but not to clothes.

The second criterion is how important the attribute is to buyers of a particular product. Brand name, it turns out, is more important to buyers of snack foods than to buyers of produce.

Both classifiers analyze data provided by providers — product descriptions — and by customers — reviews and Q&As. With both types of input data, the classifiers consider the frequency with which attribute words occur in texts associated with a given product; with the provider data, they also consider how frequently a given word occurs across instances of a particular product type.

The models were trained on data that had been annotated to indicate whether particular attributes applied to the associated products.

The data suite

Step three, data imputation, looks for terms in product descriptions that may fit the new product and attribute categories identified in the previous steps, but which have not yet been added to the graph.

This step uses embeddings, which represent descriptive terms as points in a vector space, where related terms are grouped together. The idea is that, if a number of terms clustered together in the space share the same attribute or product type, the unlabeled terms in the same cluster should, too.

Previously, my Amazon colleagues and I, together with colleagues at the University of Utah, demonstrated state-of-the-art data imputation results by training a sequence-tagging model, much like the one I described above, which labeled “black cherry cheesecake” as a flavor.

Here, however, we vary that approach by conditioning the sequence-tagging model on the product type: that is, the tagged sequence output by the model depends on the product type, whose embedding we include among the inputs.

Cleaning module.png
The architecture of the AutoKnow cleaning module.

The next step is data cleaning, which uses a machine learning model based on the Transformer architecture. The inputs to the model are a textual product description, an attribute (flavor, volume, color, etc.), and a value for that attribute (chocolate, 16 ounces, blue, etc.). Based on the product description, the model decides whether the attribute value is misassigned.

To train the model, we collect valid attribute-value pairs that occur across many instances of a single product type (all ice cream types, for instance, have flavors); these constitute the positive examples. We also generate negative examples by replacing the values in valid attribute-value pairs with mismatched values.

Finally, we analyze our product and attribute sets to find synonyms that should be combined in a single node of the product graph. First, we use customer interaction data to identify items that were viewed during the same queries; their product and attribute descriptions are candidate synonyms.

Then we use a combination of techniques to filter the candidate terms. These include edit distance (a measure of the similarity of two strings of characters) and a neural network. In tests, this approach yielded a respectable .83 area under the precision-recall curve.

In ongoing work, we’re addressing a number of outstanding questions, such as how to handle products with multiple hypernyms (products that have multiple “parents” in the product hierarchy), cleaning data before it’s used to train our models, and using image data as well as textual data to improve our models’ performance.

Watch a video presentation of the AutoKnow paper from Jun Ma, senior applied scientist.

AutoKnow: Self-driving knowledge collection for products of thousands of types | Amazon Science

About the Author
Xin Luna Dong
Xin Luna Dong is a principal scientist in the Amazon Product Graph group.

Related content

US, WA, Seattle
Job summaryAt Alexa Shopping, we strive to enable shopping in everyday life. We allow customers to instantly order whatever they need, by simply interacting with their Smart Devices such as Amazon Show, Spot, Echo, Dot or Tap. Our Services allow you to shop, no matter where you are or what you are doing, you can go from 'I want that' to 'that's on the way' in a matter of seconds. We are seeking the industry's best to help us create new ways to interact, search and shop. Join us, and you'll be taking part in changing the future of everyday lifeWe are seeking a Data Scientist to be part of the NLU science team for Alexa Shopping. This is a strategic role to shape and deliver our technical strategy in developing and deploying NLU, Machine Learning solutions to our hardest customer facing problems. Our goal is to delight customers by providing a conversational interaction. These initiatives are at the heart of the organization and recognized as the innovations that will allow us to build a differentiated product that exceeds customer expectations. We're a high energy, fast growth business excited to have the opportunity to shape Alexa Shopping NLU is defined for years to come. If this role seems like a good fit, please reach out, we'd love to talk to you.This role requires working closely with business, engineering and other scientists within Alexa Shopping and across Amazon to deliver ground breaking features. You will lead high visibility and high impact programs collaborating with various teams across Amazon. You will work with a team of Language Engineers and Scientists to launch new customer facing features and improve the current features.
US, WA, Bellevue
Job summaryThe People eXperience and Technology Central Science Team (PXTCS) uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. We are an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve this goal.We are looking for economists who are able to work with business partners to hone complex problems into specific, scientific questions, and test those questions to generate insights. The ideal candidate will work with engineers and computer scientists to estimate models and algorithms on large scale data, design pilots and measure their impact, and transform successful prototypes into improved policies and programs at scale. We are looking for creative thinkers who can combine a strong technical economic toolbox with a desire to learn from other disciplines, and who know how to execute and deliver on big ideas as part of an interdisciplinary technical team.Ideal candidates will work closely with business partners to develop science that solves the most important business challenges. They will work in a team setting with individuals from diverse disciplines and backgrounds. They will serve as an ambassador for science and a scientific resource for business teams, so that scientific processes permeate throughout the HR organization to the benefit of Amazonians and Amazon. Ideal candidates will own the data analysis, modeling, and experimentation that is necessary for estimating and validating models. They will work closely with engineering teams to develop scalable data resources to support rapid insights, and take successful models and findings into production as new products and services. They will be customer-centric and will communicate scientific approaches and findings to business leaders, listening to and incorporate their feedback, and delivering successful scientific solutions.Key job responsibilitiesUse causal inference methods to evaluate the impact of policies on employee outcomes. Examine how external labor market and economic conditions impact Amazon's ability to hire and retain talent. Use scientifically rigorous methods to develop and recommend career paths for employees.A day in the lifeWork with teammates to apply economic methods to business problems. This might include identifying the appropriate research questions, writing code to implement a DID analysis or estimate a structural model, or writing and presenting a document with findings to business leaders. Our economists also collaborate with partner teams throughout the process, from understanding their challenges, to developing a research agenda that will address those challenges, to help them implement solutions.About the teamWe are a multidisciplinary team that combines the talents of science and engineering to develop innovative solutions to make Amazon Earth's Best Employer.
US, CA, Sunnyvale
Job summaryThe Amazon Alexa app is a companion to Alexa devices for setup, remote control, and enhanced features. The Alexa app understands a customer’s habits, preferences and delivers a personalized experience to help them manage their day by providing relevant information as customers want it. We believe voice is the most natural user interface for interacting with technology across many domains; we are inventing the future. As voice-enabled technology becomes increasingly advanced, consumers are demanding more from what their voice products can do. We’re looking for Scientists who are passionate about innovating on behalf of customers, demonstrate a high degree of product ownership, and want to have fun while they make history.As a Data Scientist, you will help build a production scaled personalized recommendation, Machine Learning (ML) and Deep Learning (DL) models to help derive business value and new insights through the adoption of Artificial Intelligence (AI).Key job responsibilitiesThe successful candidate will be responsible for distilling user data insights for ML science applications and influence business decision with data-driven approach to increase Alexa mobile engagement and growth. A successful candidate will be a person who enjoys diving deep into data, doing analysis, discovering root causes, and designing long-term solutions.· Expertise in the areas of data science, machine learning and statistics.· Translate business needs into advanced analytics and machine learning models and provide strong algorithm and coding execution and delivery of Machine Learning & Artificial Intelligence.· Work closely with the engineers to architect and develop the best technical design and approach.· Being able to dive a ML / DL project from beginning to end, including understanding the business need, aggregating data, exploring data, building & validating predictive models, and deploying completed models to deliver business impact to the organization.· Analyze, extract, normalize, and label relevant data.· Work with Engineers to help our customers operationalize models after they are built.A day in the life· Design and review mobile experiments for growth and engagement· Build statistical models and generate data insights to understand mobile growth and retention· Feature engineering to improve ML model performance.· Analyze, extract, normalize, and label relevant data.· Work with Engineers to deploy applications to production· Work with product manager to convert business problems to science problems and define the solutions.About the teamAlexa Mobile Intelligence team is motivated to make Alexa mobile app being the best intelligent assistant and providing personalized relevant features and content by understanding customers' habits, preferences, hence will reach high growth and retention for the app.
US, CA, Sunnyvale
Job summaryOur Alexa Product Advisor (part of Alexa Shopping) vision is to provide the best possible answers for a wide range of questions around product being asked by the customer. Our customers ask various questions to Alexa regarding products, and not all the time we can find an answer in our knowledge sources. "Alexa, how strong is the magsafe on iPhone 12?" is a typical question that could be asked to our system. The first step in providing these answers is to form high quality classification and machine understanding of natural language questions into their core components (shape, product references, attributes, pronouns etc).Alexa Shopping is looking for an experienced Data Scientist to be a part of a team solving complex natural language processing problems and customer demand insights (including segmentation analysis and personas building using big data, ML and potentially AI). This is a blue-sky role that gives you a chance to roll up your sleeves and dive into big data sets in order to build simulations and experimentation systems at scale, build optimization algorithms and leverage cutting-edge technologies across Amazon. This is an opportunity to think big about how to solve a challenging problem for the customers and understand their requirements for products.If you are thinking how big is this, then think how we searched on desktops in 2000's, mobiles in 2010s and on voice and intelligent devices today! We want to provide a great product experience though the intelligence we are building about products on any platform, making it easier for customers to learn about the products on Echo devices, mobile app, desktop, etcYou will work closely with product and technical leaders throughout Alexa Shopping and will be responsible for influencing technical decisions in areas of development/modelling that you identify as critical future product offerings. You will identify both enablers and blockers of adoption for product understanding, and build programs to raise the bar in terms of understanding product questions and predict the shaping of customer utterances as we move from simple to complex utterances.The ideal candidate will have extensive experience in Science work, business analytics and have the aptitude to incorporate new approaches and methodologies while dealing with ambiguities in sourcing processes. Excellent business and communication skills are a must to develop and define key business questions and to build data sets that answer those questions. You should have a demonstrated ability to think strategically and analytically about business, product, and technical challenges. Further, you must have the ability to build and communicate compelling value propositions, and work across the organization to achieve consensus. This role requires a strong passion for customers, a high level of comfort navigating ambiguity, and a keen sense of ownership and drive to deliver results.
US, CA, Palo Alto
Job summaryAmazon is the 4th most popular site in the US (http://www.alexa.com/topsites/countries/US). Our product search engine is one of the most heavily used services in the world, indexes billions of products, and serves hundreds of millions of customers world-wide. We are working on a new AI-first initiative to re-architect and reinvent the way we do search through the use of extremely large scale next-generation deep learning techniques. Our goal is to make step function improvements in the use of advanced Machine Learning (ML) on very large scale datasets, specifically through the use of aggressive systems engineering and hardware accelerators. This is a rare opportunity to develop cutting edge ML solutions and apply them to a problem of this magnitude. Some exciting questions that we expect to answer over the next few years include:· Can a focus on compilers and custom hardware help us accelerate model training and reduce hardware costs?· Can combining supervised multi-task training with unsupervised training help us to improve model accuracy?· Can we transfer our knowledge of the customer to every language and every locale ?This is a unique opportunity to get in on the ground floor, shape, and build the next-generation of Amazon Search. We are looking for exceptional scientists and ML engineers who are passionate about innovation and impact, and want to work in a team with a startup culture within a larger organization.Please visit https://www.amazon.science for more information
US, CA, Sunnyvale
Job summaryAmazon Lab 126 specializes in pioneering new home experiences that brings the future one step closer. The most recent invention is Amazon Astro, a home robot that brings the family closer and provides peace of mind. Building a home robot that gracefully moves through an ever-changing environment, such as one’s home, required challenging the state-of-the-art and furthering it, in areas of Perception, SLAM, Mapping and Intelligent Motion to name a few. Packing that technology in an affordable piece of hardware that consistently accomplishes its tasks, is a whole another story!Ada Lovelace, the first computer programmer, once famously said, “Those who have learned to walk on the threshold of the unknown worlds, by means of what are commonly termed par excellence the exact sciences, may then, with the fair white wings of imagination, hope to soar further into the unexplored amidst which we live”. With the launch of Astro, we are on the threshold of something that will change our lives forever. Join us, as we soar further to imagine and invent new experiences that will one day become the future. It is still Day One!Key job responsibilitiesAs a Senior Applied Scientist in Robotics, you will work with a team of smart, passionate and diverse engineers researching and developing mobility solutions for the robot, in the areas of intelligent motion, mapping, exploration - to name a few. You will design solutions for complex and ambiguous problem areas where the business problem or opportunity may not yet be defined. Most business problems that you will take on, require scientific breakthroughs. You will provide context for current technology choices and make recommendations on the right modelling and component design approach to achieve the desired customer experience/business outcome. You will set standards and proactively drive components to utilize and improve on state-of-the-art techniques. Your will create solutions that are inventive, easily maintainable, scalable, and extensible. You will file for patents and publish research work where opportunities arise, and give internal or external presentations about your area of speciality.
IL, Haifa
Job summaryYou: Alexa, I am looking for a role in which I could learn, research, and innovate in AI and, most of all, impact the life of millions of customers worldwide. What do you suggest?Alexa: The Alexa Shopping team is looking for research engineers to help me become the best personal shopping assistant. Do you want to hear more?You: Yes, please!Alexa: As a research engineer, you will work with top researchers and engineers, both locally and abroad, to explore and develop new AI technologies helping me in my journey to become the ultimate shopping assistant for millions of customers around the world. You should have strong computer science foundations, excellent development skills, and some experience with research methodology. You also preferably have some applied or research expertise in at least one of the following fields: Web search and mining, Machine Learning, Natural Language Processing, Computer Vision, Speech Processing, or Artificial Intelligence.
US, CA, Sunnyvale
Job summaryAmazon Lab126 is an inventive research and development company that designs and engineers high-profile consumer electronics. Lab126 began in 2004 as a subsidiary of Amazon.com, Inc., originally creating the best-selling Kindle family of products. Since then, we have produced groundbreaking devices like Fire tablets, Fire TV and Amazon Echo. What will you help us create?The Role:We are looking for a passionate, talented and inventive Senior Applied Scientist - Sensors to join our team. As part of the larger technology team working on new consumer technology, your work will have a large impact to hardware, internal software developers, ecosystem, and ultimately the lives of Amazon customers. You must love high quality signal processing, enjoy sensor data analysis, optimizing sensor performance, and have a feel for what a good consumer experience should be like. In this role, you will: - Engage with an experienced cross-disciplinary staff to conceive and design innovative consumer products · Work closely with an internal interdisciplinary team, and outside partners to drive key aspects of product definition, execution and test · Development of new sensor algorithms · Optimization and porting of sensor algorithms to different platforms. · Integrate vendor hardware and software stacks · Be able, and willing, to multi-task and learn new technologies quickly · Be responsive, flexible and able to succeed within an open collaborative peer environment
IE, D, Dublin
Job summary*Flexibility for alternate EU Amazon offices*Amazon’s mission is to be the most customer centric company in the world. The Workforce Staffing organization is on the front line of that mission by hiring the hourly fulfilment associates who make that mission a reality. To drive the necessary growth and continued scale of Amazon’s associate needs within a constrained employment environment, Amazon is creating a Workforce Staffing research program.This program will re-invent how Amazon attracts, communicates with, and ultimately hires its hourly associates. This team will own multi-layered research and program implementation to drive deep learnings, process improvements, and strategic recommendations to global leadership. Are you passionate about data? Are you a tinkerer by trade? Do you enjoy questioning the status quo? Do complex and difficult challenges excite you? If yes, this may be the team for you.As a Manager, Data Science in Workforce Staffing, you will have a strong focus on quantitative data analysis, understanding labor markets and the candidates within them. You will be responsible for building and developing a team, developing roadmaps, and driving business impact through your research at global scale.You will lead data science projects using your deep expertise in statistics (regressions, multilevel models, structural equation models, etc.), and data collection in a variety of settings (e.g., field studies, surveys, existing large data sets) to define and answer nebulous problems. You leverage your quantitative background to develop and test theoretical frameworks and design experiments. You design, deployment, and conduct analysis of our global candidate research activities, using experimental, quasi-experimental, and RCT methods. You relentlessly obsess over understanding our candidates and what attracts them to Amazon. You work with colleagues across Research, Data Science, Business Intelligence and related teams to enable Amazon find and hire the right candidates for the right roles at an unprecedented scale.A customer-obsessed, relentless curiosity is a must, as is commitment to the highest standards of methodological rigor that a given study allows. This role provides opportunity for significant exposure to Amazon’s culture, leadership, and global businesses, and furthermore provides significant opportunity to influence how Workforce Staffing matches talent to business demand.This will be a highly visible role across multiple key deliverables for our global organization. If you are passionate and curious about data, obsess over customers, love questioning the status quo, and want to make the world a better place, let’s chat. #scienceemea
ES, M, Madrid
Job summary*Flexibility for alternate EU Amazon offices*Amazon’s mission is to be the most customer centric company in the world. The Workforce Staffing organization is on the front line of that mission by hiring the hourly fulfilment associates who make that mission a reality. To drive the necessary growth and continued scale of Amazon’s associate needs within a constrained employment environment, Amazon is creating a Workforce Staffing research program.This program will re-invent how Amazon attracts, communicates with, and ultimately hires its hourly associates. This team will own multi-layered research and program implementation to drive deep learnings, process improvements, and strategic recommendations to global leadership. Are you passionate about data? Are you a tinkerer by trade? Do you enjoy questioning the status quo? Do complex and difficult challenges excite you? If yes, this may be the team for you.As a Manager, Data Science in Workforce Staffing, you will have a strong focus on quantitative data analysis, understanding labor markets and the candidates within them. You will be responsible for building and developing a team, developing roadmaps, and driving business impact through your research at global scale.You will lead data science projects using your deep expertise in statistics (regressions, multilevel models, structural equation models, etc.), and data collection in a variety of settings (e.g., field studies, surveys, existing large data sets) to define and answer nebulous problems. You leverage your quantitative background to develop and test theoretical frameworks and design experiments. You design, deployment, and conduct analysis of our global candidate research activities, using experimental, quasi-experimental, and RCT methods. You relentlessly obsess over understanding our candidates and what attracts them to Amazon. You work with colleagues across Research, Data Science, Business Intelligence and related teams to enable Amazon find and hire the right candidates for the right roles at an unprecedented scale.A customer-obsessed, relentless curiosity is a must, as is commitment to the highest standards of methodological rigor that a given study allows. This role provides opportunity for significant exposure to Amazon’s culture, leadership, and global businesses, and furthermore provides significant opportunity to influence how Workforce Staffing matches talent to business demand.This will be a highly visible role across multiple key deliverables for our global organization. If you are passionate and curious about data, obsess over customers, love questioning the status quo, and want to make the world a better place, let’s chat. #scienceemea