Deep learning has been highly successful for natural-language-processing tasks, like interpreting commands on Alexa. But it’s been less successful for information retrieval tasks, like product discovery on Amazon, due to a lack of negative training examples.
It’s relatively easy to teach a deep-learning model that, say, the query “Fire HD 10” matches the product named Fire HD 10. But teaching it that “Fire HD 10 case” and “Fire HD 10 charger” do not match the same product is more subtle and requires a lot of negative training examples. However, identifying and annotating negative examples for every item in a large product catalogue would be a massive undertaking.
This year, at the annual meeting of the Association for Computational Linguistics, my colleagues and I will present a new way to train deep-learning-based product discovery systems by automatically generating negative training examples. In our experiments, this approach improved performance by 16% over the state-of-the-art approach and by 62% over the approach typically used in commercial product discovery systems.
Our technique uses adversarial learning, which in recent years has proved remarkably successful in systems for generating visual images. In a typical adversarial-learning system, a generator that synthesizes data and a discriminator that tries to distinguish real data from synthetic are trained together.
In our case, however, the system is not trying to recognize synthetic examples; it’s simply trying to classify the synthetic examples accurately. This difference allowed us to design a new neural architecture that makes adversarial learning simpler and more efficient.
Match and mismatch
The inputs to our system are a customer query and a product name. The output is a decision about whether the product is a good match for the query.
For such a system, positive training examples are relatively easy to create automatically. Many shoppers who enter the search query “Fire HD 10” may in fact click on links for cases and chargers, but odds are that more will click on the link for the device itself. By aggregating statistics across many customer queries, an automated system can reliably produce examples of well-matched queries and products.
Negative examples are harder. Customers sometimes intentionally click mismatched items (such as cases and chargers after the query “Fire HD 10”), while conversely, a lack of clicks is no guarantee that queries and items are mismatched.
 
    Existing natural-language-processing systems could determine that, for instance “telescope” is a bad match for the query “running shoes”, but a product discovery system wouldn’t learn much from such egregiously mismatched examples. Our goal: develop a generator that would automatically produce more challenging negative examples, such as the product “hiking shoes” mismatched with the query “running shoes”.
Hence our use of adversarial learning. During training, we feed our network automatically labeled positive examples. On a random basis, it chooses some of them for conversion to negative examples. The generator overwrites half of the example — the query — and changes its label from “match” to “mismatch”.
In a typical application of adversarial learning, the generator and discriminator are simultaneously trained on competing objectives, which complicates the machine learning process. In our case, however, we simply switch back and forth between objectives during training.
 
    Moreover, the switch mechanism is a set of simple arithmetic operations performed within the network, which means that we can train the network using the standard machine learning algorithm (back-propagation).
Another key to our network is the attention layer that immediately precedes the classification layer. The attention layer learns to concentrate the classifier’s attention on elements of both the query and the product name that are particularly important for assessing matches.
For instance, if the query is “Fire HD 10 case”, and the matched product is a Fire HD 10 case, the attention layer will give greater weight to the word “case” than to the word “Fire”, since “case” better distinguishes the query from other queries relating to Fire tablets.
 
    In experiments, we compared our model to four others, including gradient-boosted decision trees, which are commonly used in the product discovery space, and MatchPyramid, a model proposed four years ago that has been shown to significantly outperform other models on matching tasks.
We measured performance using two different metrics, both of which factor in both false positives and false negatives: F1 scores and area under the precision-recall curve (APR). MatchPyramid was the best-performing of the baselines, but our model surpassed it by 16% on F1 score and 8% on APR. Compared to gradient-boosted decision trees, our model’s improvements were 62% on F1 score and 57% on APR.
 
     
     
    