Have you ever taken a look at the “probability of outperforming” metric in Google Analytics’ Content Experiments and wondered how it was calculated? Have you ever scratched your head because the numbers didn’t make sense to you? I certainly have. It’s hard to see experiment results like the ones depicted below and *not* wonder what’s going on underneath the hood.

In this post, we’ll highlight how Google’s Content Experiments work, why it’s a really smart idea, and why you might still want to do a little bit of the heavy lifting yourself…

## How Google’s experiments work

The core of GA content experiments is what’s known as the “multi-armed bandit problem,” a concept that describes any situation in which you want to conduct an experiment in such a way that you maximize your reward rather than doing pure exploratory analysis. This is precisely the situation we find ourselves in with web content experiments.

Suppose you started an experiment on your website and, three weeks in, your original and experiment variants have e-commerce conversion rates of 2% and 3%, respectively. Unfortunately, your site only makes a couple hundred sales a week, so your current p-value is hovering at a .06. What do you do? Let it run another week to see if p drops below .05? I certainly hope not! If I was 94% sure that an experiment variant was going to lift my sales by 50%, I’d go with it!

Google implements this very logic in their content experiments, except in a dynamic way. As the experiment goes along, more and more traffic gets re-directed towards the winning variant, in order to help maximize reward. However, the algorithm will continue to send at least a little bit of traffic to the under-performing variants, so that the experiment results will continue to approach statistical significance. It’s a great methodology that can simultaneously shorten the lifetime of tests and save lots of conversions.

As for how Google’s algorithm determines the probability of outperforming, it operates on Bayesian inference. In layman’s terms, Google starts off with an educated guess about the usual payoff of content experiment variants and, as the experiment runs, continues to update its educated guess with real information. Eventually, it starts to learn which variant has the bigger payoff, and it can start to calculate the probability that it’s better from there. (If you’re interested in the gritty mathematical details, you’ll find all the information you want in this journal article on multi-armed bandits by Steve Scott, Google’s senior economic analyst.)

Of course, Google’s methodology is fantastic… except when it’s not. There are two situations in which you might not want to let Google do all the heavy lifting. We’ll consider each in turn.

## Turn off the multi-armed bandit to evaluate multiple goals

The first reason to avoid Google’s multi-armed bandit approach is if you want to evaluate your experiment across multiple goals. Suppose you have an experiment intended to increase adds-to-cart. One of the variants wins handily. But you start to wonder – have I really increased sales, or were these people just adding more items to cart and then leaving anyway? If you used Google’s multi-armed bandit, you may never be able to know. Once the algorithm figures out one of the variants has a higher add-to-cart rate, it’ll start sending all your traffic there. But that may leave you with too small of a sample in some of your variants when you go to analyze the actual difference in conversion rate.

If you think you might want to evaluate multiple outcomes in your experiment, I’d suggest disabling the multi-armed bandit. You can do so by setting “distribute traffic evenly across all variations” to “on” in the advanced settings for your experiment.

## Double-check Google’s probabilities

The second reason you might want to do your own probability calculations is that, as I alluded to in the introduction, sometimes Google’s Bayesian algorithm generates some really wacky probabilities. For example, I recently worked with a site that ran a test of a green button vs. a blue button. Some basic conversion stats for this experiment looked like this:

Google was reporting that the blue button had a 65% chance of outperforming the green. However, a simple chi-squared test on the data came back with a p of .01. That’s a pretty wild discrepancy.

I got in touch with Google’s Senior Economic Analyst, Steve Scott, about the results, who confirmed that Google’s algorithm can be very sensitive to the prior distribution selected for the analysis. Again in layman’s terms, this simply means that sometimes the educated guess Google makes about the probabilities of success in the experiment don’t line up well with reality, which can slow the experiment down as it’s trying to determine a winner.

The bottom line in all of this is that Google’s algorithm is going to do its best to drive your experiment towards a quick resolution, all while maximizing the benefit to you as a site owner. Unfortunately, it’s not going to be perfect. In my experience, double-checking the algorithm’s reported probabilities with some simple intuition and a vanilla statistical test or two can be a good way to make sure you’re not waiting longer than you should be to pick a winner for your test.

Matthew DealSeptember 24, 2015 / 8:58 pmGreat explanation. I’ve personally always used multi bandit experiments, but now I’m second guessing myself.

GeorgiJune 13, 2017 / 4:12 amHi Dayne,

Not sure if you saw this post by Chris Stucchio on why bandits are most likely not well-suited for A/B testing: https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html Basically, you can’t apply them without breaking some of their fundamental assumptions.

The same is true for classical tests of statistical significance, where one needs to have their sample size fixed in advance. With all the pressure you get from stakeholders to deliver results early and stop losers and promote winners, it’s near-impossible to apply in practice, even with an informed audience.

If you are still looking for efficient ways to run A/B tests, I’d suggest giving AGILE A/B Testing – a new statistical approach to A/B testing that borrows from medical trial design, a try.

daynebattenJune 22, 2017 / 7:59 amSuper useful stuff. You’re absolutely right that the best approach is to calculate the statistical power you want ahead of time, figure out the sample size you need to get there, go get a sample that’s big enough, and then run your stats.

P-hacking and false positives are rampant in the world of academic publishing, where there’s peer review. I shudder to think how much more of it goes on in the world of web A/B testing…