Model assesses the validity of tips offered in product reviews
Method would enable customers to evaluate supporting evidence for tip reliability.
Product reviews are a popular and important feature in e-commerce websites, which many customers rely on in their shopping journeys. The reviews often contain personal experiences and opinions that can help other customers make more informed purchasing decision. Additionally, the reviews contain practical and non-obvious advice for making better, easier, and safer use of products. For example, “Charge for 8 hours before using this camera for the first time.” Such recommendations are referred to as “product tips”.
To save customers from having to read through tens and even hundreds of reviews to locate helpful tips, researchers have introduced automatic methods to extract tips from reviews. These tips can be presented, for example, in dedicated widgets on the sites. However, as tips are typically non-obvious recommendations, customers may rightfully question their validity and look for support or opposition from fellow customers.
In a paper that we presented at this year’s meeting of the ACM Special Interest Group in Information Retrieval (SIGIR), which we cowrote with Miriam Farber (who was at Amazon when the work was done) and David Carmel, we present a method for determining the degree to which a tip is supported or opposed by all of a product’s reviews.
At the heart of our method is a model that determines the level of support, contradiction, or neutrality between a tip and a sentence from another review. This is a challenging task, as support and contradiction between two natural-language sentences come in many forms. For example, the recommendation “Charge for 8 hours before using this camera for the first time” is supported by the sentence “it’s recommended to charge before usage” but contradicted by the statement “The battery comes pre-charged”.
In an experiment using product tips from multiple product categories, we retrieved for each tip up to five review sentences that our model identified as supporting the tip and up to five sentences identified as contradicting it. At coverage of 50% — that is, when we restrict ourselves to the 50% of tip-sentence pairs for which our model makes its most confident predictions — our method achieves precision of 72% and 58% in detecting support relation and contradiction relation respectively.
As our task is precision oriented, we also consider coverage of 25% and find that the precision is improved to 79% and 67% in detecting support and contradiction relations. These results reflect 8% and 29% relative improvements over off-the-shelf models, attesting to the challenging nature of this task. We further found that at least half of extracted tips have supporting reviews, and at least a third have contradicting reviews.
Our new method can potentially be integrated into widgets that offer tips and also provide their support levels and links to related reviews, so customers can assess their validity.
Tips’ support-level estimation
Our method operates in three steps, as shown in the following example:
Step 1: Given a product tip that was extracted from a customer review, our goal is to measure the amount of support and contradiction the tip receives from all reviews of that product. However, some products have thousands of reviews, so our algorithm retrieves the few hundred sentences with the greatest similarity to the tip. We estimate similarity using nearest-neighbor search over sentence embeddings. This is done in order to expedite the next steps, which rely on more computation-intensive models.
Step 2: Using a sentence-to-sentence support-level classifier, we compute a support score and a contradiction score for the tip and each of the related sentences. The support-level classifier is a neural model that was trained on pairs of sentences that were manually annotated as supportive, contradictory, or neutral relative to each other. The classifier outputs three scores — for support, contradiction, and neutrality — that sum to 1.
Step 3: Finally, all the support scores and contradiction scores are aggregated over all related sentences, providing a global support score and a global contradiction score, which reflect the support level of all reviews relative to the given tip.
With the ability to estimate a tip’s support and contradiction scores, we define the following taxonomy to characterize a tip:
- Highly supported: Tip with many supporting and almost no contradicting sentences.
- Highly contradicted: Tip with many contradicting and almost no supporting sentences.
- Controversial: Tip with many supporting and many contradicting sentences.
- Anecdotal: Tip with almost no support and no contradiction sentences.
In order to examine the distribution of tips according to this taxonomy, we split the support and contradiction scores into three ranges, low, medium, and high. The tips are then assigned to the cells they belong to, creating three-by-three heat-maps.
As examples, the figure below presents the heat map (a) across all categories and (b) for the apparel category. We found that controversial tips are very common in the apparel category (43% of tips). These tips are often size related, e.g., "Order a size bigger than what you would normally wear", while other reviews suggest, "This is true to size and fits perfectly."
Product reviews, and product tips in particular, are important and helpful to customers. We believe that by presenting the support level per tip and providing links to supporting or opposing reviews, we can help customers estimate tips’ validity and decide how much credence to give each tip.