Removing selection bias from evaluation of recommendations

Causal machine learning provides a powerful tool for estimating the effectiveness of Fulfillment by Amazon’s recommendations to selling partners.

October 21, 2024

7 min read

More than 60% of sales in Amazon’s store come from independent sellers. One of the big drivers of this growth has been Fulfillment by Amazon, or FBA, which is an optional program to let sellers outsource order fulfillment to Amazon. FBA provides customers access to a vast selection of products at fast delivery speeds, and it lets sellers leverage Amazon’s global logistics network and advanced technology to pick, pack, and ship customer orders and handle customer service and returns. FBA also uses state-of-the-art optimization and machine learning models to provide sellers with inventory management recommendations, such as how much of which products to stock, how to promote products through sponsored ads, and whether and when to sell excess inventory at a discount.

The goal of these recommendations is to improve seller performance — for example, to maximize seller-relevant outcome metrics such as revenue, units shipped, and customer clicks on product listings. To determine whether the recommendations are working, we would like to compare the results sellers get from aligning with Amazon FBA recommendations with the results they would get from not aligning with them.

Selection bias

To measure and monitor the efficacy of such recommendations, we would, ideally, run experiments regularly. But we don’t run such experiments, because we want to preserve a positive seller experience and maintain fairness, and we do not want to negatively influence seller decisions. Let us explain.

An experiment involves two groups: the treatment group, which receives an intervention (such as a recommendation), and the control group, which doesn’t receive this intervention. A well-designed experiment would randomly assign some participants to the treatment group and others to the control group to ensure unbiased comparisons.

To avoid subjecting sellers to such differential treatment, we instead rely on data that we collect by observing sellers’ decisions and the resulting outcomes. Our methodology is, therefore, well-suited for environments in which experimentation is potentially infeasible (e.g., healthcare, where experimentation would disrupt patient treatment and outcomes).

Animation shows a map of the United States and each of the 8 individual regions that resulted from Amazon's regionalization effort

Double machine learning

Our solution is to use double machine learning (DML), which combines two models to estimate causal effects: one model estimates the expected seller outcome, given the decision to align or not align with the recommendation; the other estimates the propensity to align with the recommendation. Variation in those propensities is the source of the selection bias.

Each model receives hundreds of inputs, including inventory management and product data. For each seller, we compute the residual of the seller outcome model (the difference between the model’s prediction and the actual outcome) and the residual of the seller decision model (the difference between the model’s prediction and the seller’s actual decision to follow recommendation). These residuals represent the unexplained variation in the seller outcome and the seller decision — the variation not explained by observable data.

How Amazon’s Supply Chain Optimization Technologies team has evolved over time to meet a challenge of staggering complexity.

Therefore, we “remove” any influence our inputs (e.g., the experience level of the seller) may have on the treatment effect estimate. When we regress the residuals of the outcome model on the residuals of the decision model, we estimate the impact of the unexplained variation in treatment status on the unexplained variation in the outcome. The resultant estimand is thus the causal impact of the seller’s decision to follow recommendations on the outcome.

In our tutorial, we show how to use this method to compute the average treatment effect (ATE), the average treatment effect on the treated (ATT), and the conditional average treatment effect (CATE). ATE is the overall effect of the treatment (following the FBA recommendation) on the entire population of FBA sellers. It answers the question “On average, how much does following the recommendation change the seller outcome compared to not following the recommendation?”

ATT focuses on sellers who actually followed the recommendation. It answers the question “For those who followed the recommendation, what was the average effect compared to not following the recommendation?”

CATE breaks it down even further, looking at specific subgroups based on characteristics such as product category or current inventory level. It answers the question “For a specific group of sellers and products, how does following the recommendation affect them compared to not following the recommendation?”

Our approach is agnostic as to the type of machine learning model used. But we observe that, given the scale and tabular nature of our data, gradient-boosted decision trees offer a good compromise between the high efficiency but lower accuracy of linear-regression models and the high accuracy but lower efficiency of deep-learning models. Readers who are interested in the details can attend the INFORMS tutorial — or read our paper in the forthcoming issue of the TutORials in Operations Research journal.

In closing, before we make recommendations to sellers to help improve their outcomes, we carry out rigorous scientific work to build the recommendation algorithms, monitor their outcomes, and revise and rebuild them to ensure that seller outcomes really do improve.

Acknowledgments:

Xiaoxi Zhao, Ethan Dee, and Vivian Yu for contributing to the tutorial; FBA scientists for contributing to the Seller Assistance Efficacy workstream; Michael Miksis for managing the related product and program; FBA product managers and engineers for pushing the outcome of this workstream into their respective products; Alexandre Belloni and Xinyang Shen for their constructive suggestions; and WW FBA Leadership for their support.

About the Author

Özalp Özer

Özalp Özer is the director of worldwide Fulfillment by Amazon science.

Serdar Şimşek

Serdar Şimşek is an associate professor of operations management at the University of Texas at Dallas and an Amazon Visiting Academic working with Fulfillment by Amazon.

Removing selection bias from evaluation of recommendations

Causal machine learning provides a powerful tool for estimating the effectiveness of Fulfillment by Amazon’s recommendations to selling partners.

Selection bias

Double machine learning

Related content

Work with us