How Long Does It Take to Get Accurate A/B Test Results?

August 7th, 2014 by Andy

It’s no surprise that A/B testing headlines is all the rage these days—A/B testing is a proven method of increasing conversions and teaches editors how to write better headlines. But because stories don’t live forever, it’s good to ask a few questions when trying to decide if headline testing is right for you.

We’d all love it if we could instantly know if one headline does better than another at increasing conversions. But the reality is that an A/B test, just like any other experiment, needs to run long enough in order to give us reliable results—sometimes longer than headlines are relevant.

In this blog post, we’ll look at what it means for an experiment to be accurate and then look at what we can do to make A/B tests quick and accurate.

Any reliable experiment needs to have these two properties:
  1. If our headlines have about the same performance, statistical significance is the low probability that we accidentally interpret statistical noise as a difference in conversions.
  2. If there is actually a big difference in headline conversions, statistical power is the high probability that we detect this difference.

Suppose a headline has a 5% click-through rate (CTR) on a landing page that gets 694 visitors per hour. If this pageview rate is constant, this equates to about half a million monthly pageviews. The hypothesis we want to test is whether a new headline that we wrote performs at least 50% better or worse than our baseline—that is, if it has a CTR of less than 2.5% or greater than 7.5%.

We’ll design an experiment that has 5% statistical significance and 90% power. If our hypothesis is false, we’ll give a false positive only 5% of the time, which means that 95% of the time we’ll have the right answer. If our hypothesis is true, our experiment gets things right 90% of the time.

In the scenario above, we would have to run our A/B test for about 5 hours before we get enough trials to end our experiment. [1]

Three factors influence how quickly an experiment can achieve statistical significance and statistical power:

1. Landing page traffic. The more visitors viewing your landing page, the more quickly you have human subjects to experiment on. The time needed to complete an A/B test is inversely proportional to your traffic. For instance, if our landing page gets 1,388 visitors per hour (one million monthly pageviews) instead of 694, we reduce the amount of time needed to two hours and 30 minutes.

2. Base click through rate. Headlines that appear prominently and above the fold tend to have better click through rates, which means we quickly converge to precise estimates of click-through rates. Suppose we change our baseline CTR from 5% to 10%. Keeping our 1,388 visitors per hour, our experiment time decreases again to about one hour and 11 minutes.

3. Difference in headline performance. If two headlines perform similarly, it’ll take more trials to be sure that the small differences we’re observing aren’t just noise. Suppose that we think that our new headline is going to be either much better or much worse than our baseline. We modify our hypothesis to ask whether a new headline that we wrote performs at least 75% better or worse than our baseline. Keeping our 1,388 visitors per hour and our baseline CTR of 10%, we see that our experiment time decreases by half yet again to 32 minutes.

What does this mean for me?

  1. If your landing page gets more than a million pageviews a month, you’ll be able to reliably A/B test your top headlines before your stories get old - you’ll most likely get results within a couple of hours. With more traffic, you’ll be able to test less prominent headlines on your page and reduce the amount of time needed for your A/B tests.
  2. If your site gets less than a million pageviews a month, there’s still hope! Because click-through rates and the differences between your headlines are also major factors in determining the speed of A/B tests, A/B testing might still work for you.
  3. On a typical landing page, your top performing headlines above the fold might have a 5-7% click-through rate. As you scroll down the page, headline CTR’s tend to drop below 1% for the average link. Unless you have a massive amount of traffic (more than 10 million monthly pageviews) or have two headlines that are drastically different from each other, you’re probably going to wait more than few hours before you get results on headlines below the fold.
References

1. http://www.evanmiller.org/ab-testing/sample-size.html

Note: An earlier version of the post listed incorrect A/B testing times. Those times corresponded to different levels of statistical significance and power than the ones described in this post. The times have been corrected to reflect a statistical significance of 5% and a statistical power of 90%.