Statistical power calculations revisited: Incorporating beliefs about effect sizes
2025
In A/B testing, statistical power depends on both the variance of estimated impacts and the distribution of true impacts. A low variance metric can have low power if true impacts on the metric tend to be small, while a high variance metric can have high power if true impacts on the metric tend to be large. Traditional power calculations, however, focus solely on the variance of estimated impacts. They compute the probability of detecting a fixed effect size or the smallest effect size that can be detected with high probability (i.e., the “minimum detectable effect” or MDE). While such calculations capture the role of the variance of estimated impacts, they do not provide a way to measure expected power taking into account uncertainty or beliefs about the distribution of true impacts. In this paper, we present two approaches to connecting power calculations to beliefs about the distribution of true impacts. First, we show how frequentists can compute “prior-informed aver age power” by taking a weighted average of conventional power across different effect sizes, with weights based on how likely that effect size is believed to occur. Second, we show how Bayesians can compute “Bayesian decision power” by taking a weighted average of the probability of meeting a launch or dial down criteria across different effect sizes, with weights again based on how likely that effect size is believed to occur. When true impacts are assumed to be normally distributed, both approaches yield simple closed-form expressions that can be computed using data readily available in most A/B testing tools. These approaches enable A/B testing tools to provide more realistic and informative assessments of statistical power. By incorporating beliefs about the distribution of true impacts, they can better inform experiment design decisions such as traffic allocation and duration by leveraging the relative power of different metrics. This is especially valuable given that many large A/B testing tools already estimate beliefs regarding the distribution of true impacts via empirical Bayes methods but rarely leverage them in thinking about power. We provide a simple way to close the gap, aligning power calculations with the same beliefs regarding true impacts used in Bayesian inference.
Research areas