# Explaining changes in real-world data

## New method identifies which causal factors contribute most to observed changes in probability distributions.

The success of deep learning is a testament to the power of statistical correlation: if certain image features are consistently correlated with the label “cat”, you can teach a machine learning model to identify cats.

But sometimes, correlation is not enough; you need to identify causation. For example, during the COVID-19 pandemic, a retailer might have seen a sharp decline in its inventory for a particular product. What caused that decline? An increase in demand? A shortage in supply? Delays in shipping? The failure of a forecasting model? The remedy might vary depending on the cause.

Earlier this month, at the International Conference on Artificial Intelligence and Statistics (AISTATS), my colleagues and I presented a new technique for identifying the causes of shifts in a probability distribution. Our approach involves causal graphs, which are graphical blueprints of sequential processes.

Each node of the graph, together with its incoming edges, represents a causal mechanism, or the probability that a given event will follow from the event that precedes it. We show how to compute the contribution that changes in the individual mechanisms make to changes in the probability of the final outcome.

We tested our approach using simulated data, so that we could stipulate the probabilities of the individual causal mechanisms, giving us a ground truth to measure against. Our approach yielded estimates that were very close to the ground truth — a deviation of only 0.29 according to L1 distance. And we achieved that performance even at small sample sizes — as few as 500 samples, drawn at random from the probability distributions we stipulated.

Consider a causal graph, which represents factors contributing to the amount of inventory that a retailer has on hand. (This is a drastic simplification; the causal graphs for real-world inventory counts might have dozens of factors, rather than five.)

Each input-output relation in this network has an associated conditional probability distribution, or causal mechanism. The probabilities associated with the individual causal mechanisms determine the joint distribution of all the variables (X1-X5), or the probability that any given combination of variables will occur together. That in turn determines the probability distribution of the target variable — the amount of inventory on hand.

A large change to the final outcome may be accompanied by changes to all the causal mechanisms in the graph. Our technique identifies the causal mechanism whose change is most responsible for the change in outcome.

Our fundamental insight is that any given causal mechanism in the graph could, in principle, change without affecting the others. So given a causal graph, the initial causal mechanisms, and data that imply new causal mechanisms, we update the causal mechanisms one by one to determine the influence each has on the outcome.

The problem with this approach is that our measurement of each node’s contribution depends on the order in which we update the nodes. The measurement evaluates the consequences of changing the node’s causal mechanism given every possible value of the other variables in the graph. But the probabilities of those values change when we update causal mechanisms. So we’ll get different measurements, depending on which causal mechanisms have been updated.

To address this problem, we run through every permutation of the update order and average the per-node results, an adaptation of a technique from game theory called computing the Shapley value.

In practice, of course, causal mechanisms are something we have to infer from data; we’re not given probability distributions in advance. But to test our approach, we created a simple causal graph in which we could stipulate the distributions. Then, using those distributions, we generated data samples.

Across 100 different random changes to the causal mechanisms of our graph, our method performed very well; with 500 data samples per change, it achieved an average deviation from ground truth of 0.29 as measured by L1 distance. Our ground truth is at least a 3-D vector (6-D at most), with at least one component whose magnitude is at least one (five at most). Therefore, a 0.29 L1 distance in the worst case is still a relatively small distance from 1.

We tested different volumes of data samples, from 500 to 4,000, but adding more samples had little effect on the accuracy of the approximation.

Internally, we have also applied our technique to questions of supply chain management. For a particular family of products, we were able to identify the reasons for a steady decline in on-hand inventory during the pandemic, when that figure had held steady for the preceding year.

Research areas

## Related content

• December 07, 2023
Using gradient diversity to optimize selection of past samples for retention improves performance while combatting catastrophic forgetting.
• Staff writer
December 06, 2023
Research on natural-language understanding seeks to harness the power of large language models, while query reformulation and text summarization emerge as topics of particular interest.
• Staff writer
December 04, 2023
UT Austin-Amazon Science Hub seeks to advance research in artificial intelligence, machine learning, and large language models.

## Work with us

US, WA, Seattle
DE, BE, Berlin
US, WA, Seattle
US, WA, Seattle
Are you interested in working with top talent in Optimization, Operations Research and Supply Chain to help Amazon to efficiently match our Devices with worldwide customers? We have challenging problems and need your innovative solutions to make tremendous financial impacts! The Amazon Devices Science team is looking for a Research Scientist with background in Operations Research, Optimization, Supply Chain and/or Simulation to support science efforts to integrate across inventory management functionalities. Our team is responsible for science models (both deterministic and stochastic) that power world-wide inventory allocation for Amazon Devices business that includes Echo, Kindle, Fire Tablets, Amazon TVs, Amazon Fire TV sticks, Ring, and other smart home devices. We formulate and solve challenging large-scale financially-based optimization problems which ingest demand forecasts and produce optimal procurement, production, distribution, and inventory management plans. In addition, we also work closely with demand forecasting, material procurement, production planning, finance, and logistics teams to co-optimize the inventory management and supply chain for Amazon Devices given operational constraints. Key job responsibilities The successful candidate will be a self-starter, comfortable with ambiguity, with strong attention to detail, and ability to work in a fast-paced and ever-changing environment and a desire to help shape the overall business. Job responsibilities include: - Design and develop advanced mathematical, simulation, and optimization models and apply them to define strategic and tactical needs and drive appropriate business and technical solutions in the areas of inventory management and distribution, network flow, supply chain optimization, and demand planning - Apply mathematical optimization techniques (linear, quadratic, SOCP, robust, stochastic, dynamic, mixed-integer programming, network flows, nonlinear, nonconvex programming) and algorithms to design optimal or near optimal solution methodologies to be used by in-house decision support tools and software - Research, prototype and experiment with these models by using modeling languages such as Python; participate in the production level deployment - Create, enhance, and maintain technical documentation, and present to other Scientists, Product, and Engineering teams - Support project plans from a scientific perspective by managing product features, technical risks, milestones and launch plans - Influence organization's long-term roadmap and resourcing, and onboard new technologies onto the Science team's toolbox We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, VA, Arlington