Archive for August, 2014

What Does Viewability Mean for Advertisers?

August 29th, 2014 by Alexandra

This is the third post in a series about online advertising measurement and methodologies. Feel free to email me or post in the comments section about topics you’d like to see covered in this series. Curious about Chartbeat advertising tools? Learn more here.

In the last post, we took a look at what viewability means for publishers. Now, we’re turning the tables and breaking down what viewability means for advertisers.

So, how are digital advertisers responding to the MRC’s viewability certification?

Similar to publishers, digital advertisers have started turning to viewability vendors to audit campaigns and measure the viewability of their ads. This process gives brand marketers and agencies the opportunity to monitor whether their ads have the chance to be seen and to differentiate between quality/value of ad placements. In short, viewability offers advertisers a new level of transparency that will, in effect, set new standards to which media vendors will be held.

And with that new transparency come new terms and conditions. For example, we’re slowly starting to see agencies build viewability percentage goals or viewable impression guarantees as line items in their requests for proposals (RFPs). This means that prior to advertisers signing any contracts, media sellers will either agree to optimize an advertiser’s campaign toward a certain (goal) viewability percentage, or they will guarantee a mutually agreed upon viewability percentage. Sellers will be expected to deliver on said targets in order to receive payment in full.

On a larger scale, viewability is forcing marketers to rethink their approaches to media buying. For some, that may mean rolling out a campaign and then only paying for ads that were in view. For others it may mean doing a “controlled” buy beforehand to only serve ads that are in view. As we mentioned in the previous post, publishers and advertisers are still trying to find middle ground here. That said, larger companies like Google and Yahoo!, as well as a few smaller networks, are already allowing advertisers to pay only for viewable ads. Whether that will be the go-to model moving forward is still uncertain. We’ll have more on this topic later in our series.

How are advertisers using viewability to their advantage?

Advertisers’ ability to measure and monitor viewability will potentially have significant impact on their business as a whole. As we see it, there are a few game changers:
  1. Viewable brand lift will become a real-time metric of success. On the brand marketing side, advertisers will be able to start identifying which elements increase viewability and impact, and make changes while a campaign is still running. A recent Nielsen study found that brand lift improved by 31% when responses from non-viewable ads were filtered out. Beyond seeing which ads were viewable, advertisers will be able to gain a better sense of which viewable ads resonated with their audience(s) and take those learnings to optimize future campaigns.
  2. Improving viewability will improve campaign results for advertisers. It’s not surprising that eliminating non-viewable ads (those not able to be seen by a visitor) from the equation would be a win for advertisers.  Kellogg’s, for example, found that improving viewability by 40% resulted in a 75% increase in the sales lift of a digital campaign. Receiving real-time ad viewability data—that enabled Kellogg’s to stop buying ads that nobody was seeing—had a significant impact on the company’s ROI.

By eliminating non-viewable ads from the picture, advertisers will be able to allocate more, if not all, of their campaigns to premium content sites. Let’s break it down: As I mentioned in my last post, viewability will allow sites that offer quality content and user experiences to stand out from those that do not. If we link viewability to page quality, publishers seeking to increase viewability will likely improve—if they haven’t already—the quality of their sites by making adjustments to page layouts, ad quantity and type. As site quality increases, engagement will likely increase as well, making certain pages and audiences become more valuable to advertisers. Conversely, the bad-actors (pages cluttered with ads, link-bait content) will find their inventory becoming less and less valuable to the market. Advertisers will be able to compare the performance of their campaigns across publishers, and will ultimately choose to buy ads on sites that offer both quality content and a quality audience.

In short, more transparency will lead to more informed ad buys, and theoretically, less wasted ad spend. When demand-side platforms (DSPs) optimize for viewable ads, thus taking non-viewable ads out of the equation, ad dollars will deliver more ROI.

What challenges are advertisers facing during this transition?

First, there is the obstacle of an impression being viewable versus being actually viewed. If you’ll recall, the IAB defines a viewable impression as one that’s at least 50% visible in the viewable portion of a person’s browser window for at least one second. So, technically, viewability tracks if an ad has a chance to be seen by a user, not if it actually was. Yes, an ad may have been served, but that doesn’t mean it was seen. Think, for a second, how many times you’ve done a quick scan of a page and completely missed an ad.

Note: It’s important to remember that measuring viewability is no easy feat. And while viewability may have a number of blind spots, it’s better than anything we’ve had before. Short of something (pretty crazy) like eye-tracking, viewability does do a solid job of helping to call out the bad stuff, read: non-viewable ads, link-bait content and poor quality, ad-heavy user experiences.

Second, as with publishers, many of the challenges advertisers are facing are due to the binary natures of a viewable impression. Specifically, viewability on its own does not offer a comprehensive measurement of impact or brand lift. A simple “viewable or not” fails to fully measure the effectiveness of an impression, as it doesn’t offer a nuanced view of the extent to which an ad resonated with an audience. In other words, viewability does not take into account that the amount of time beyond one second in view improves response rates. We’ll take a deeper dive into Active Exposure Time—measuring the amount of time an actively engaged audience is exposed to a display ad—and how it impacts brand recall later in the series. If you’re dying to know more now, check out this post by our chief data scientist Josh Schwartz.

And finally, the discrepancies in measurements between viewability vendors will create a few obstacles for advertisers as well. Remember how we said publishers may not get paid for ads that one vendor says aren’t viewable, even when another gave the thumbs up? Well, the same goes for advertisers, except they may end up paying for ads that nobody ever sees. Sound familiar? This isn’t the first time advertisers are facing this problem. More on the nitty-gritty mechanics of viewability measurements to come. Stay tuned...

Also in Our Series:

 

Clicks, Coins, and Eight Cups of Tea

August 28th, 2014 by Dan

The story goes like this:

Sometime around 1935, the eminent statistician Ronald A. Fisher met a lady. The lady in question had a seemingly outrageous claim: She could tell, simply by taste, whether milk or tea was added to a cup first. But Fisher was skeptical, and, being the eminent statistician that he was, he developed a method by which to test her claim. He would present her with a series of eight cups; in half of those, milk would be placed in the cup first, and in the other half, tea would be placed in the cup first. Fisher would then hand the cups to the lady in random order and ask her which liquid was added first. If she performed better than chance, he'd believe her claim.

Fisher’s description of this account was one of the first applications of hypothesis testing, perhaps the most widely used—and arguably one of the most important—statistical concepts of the modern era. That it was focused on a simple two-outcome choice is not surprising. Many processes we run into every day can be modeled in this way, as a coin flip. Will the subway show up on time? Will someone click the link to this article? Will the Detroit Tigers or the Kansas City Royals win this game? 1

These kind of problems—those in which you only have two outcomes—are known in statistics as Bernoulli processes. The main parameter governing these phenomena is the probability that a trial has succeeded. In Fisher’s example, this is the probability that the lady correctly identifies whether milk or tea is added first. For web traffic, this is the probability of clicking a link. In many of these types of two-outcome problems, you want to know how likely it is that you’ll observe some number of successes in a given number of trials. For example, you may be interested in the probability that 50 people will click on a link if 100 people see it. If you make an assumption that each event (i.e., each click) is independent of the previous event, the probability that you see some number of successes can be described by the binomial distribution. With a firm understanding of Bernoulli processes and the binomial distribution, you are equipped for modeling a whole host of binary-outcome problems.

Is this a fair coin?

A binomial distribution, however, isn’t incredibly useful if we don’t know the probability of success for a single trial. Honestly, this is what we’re typically interested in finding out, and it is really what Fisher tested: He assumed the probability of a lady guessing whether milk or tea was added first was pure chance (50/50), and developed a test to see if the data were consistent with the results of the experiment. But, in general, how do we determine what this probability is?

There are two ways we can estimate the probability from a set of trials. We could simply count the number of successes we’ve had and divide by the total number of trials. For instance, if we flipped a coin 10 times and it came up heads 3 of those times, we might guess that the coin is an unfair coin, landing on its head only 30% of the time. This is all well and good, but we only flipped the coin 10 times. How certain are we that the probability is actually 0.3? Perhaps it truly is a fair coin and our sample size was just too small.

Alternatively, we could assume that our probability of interest itself has some distribution. That is, perhaps we think that the probability is about 0.3, but we concede that it could be 0.1 or 0.5 or even 0.99999. Treating our parameter as a distribution is the heart of a technique known as Bayesian inference, which is based upon Bayes rule:

bayes-rule

Don’t be intimidated by this equation—it is actually fairly intuitive. The left-hand side represents the answer to the question: Given the data we observed, how certain are we that our quantity-of-interest takes on a given value? This is called the posterior distribution. The right-hand side contains information about what we believe about the process we’re interested in. Prob(quantity-of-interest) is known as the prior distribution. This describes our initial beliefs about the quantity we’re trying to find out about; in this case, our probability of success in the Bernoulli trial. Prob(observation | quantity-of-interest) is called the likelihood. The likelihood describes what we believe the distribution of the data to be if we assume our quantity is a specific value. In our click-through/coin-flipping example, this is simply the binomial distribution. If we know the fairness of the coin p, then the probability we get M successes out of N flips follows a binomial distribution with parameters M and N. Then, a simple multiplication of our prior and our likelihood gives us our posterior. 2

The above equation may not seem very impressive, but the real power of the Bayesian method comes in when we iteratively apply the equation to update our beliefs. That is, we can use a previously calculated posterior as a prior in a next round of calculation to update our posterior. If we do this enough times, we hope to decrease our uncertainty enough so that we can confidently determine what our “true” probability is. The neat thing is that if we choose our prior intelligently, we can get the math to work out so that updates are fairly easy.

That’s the math, but here is a concrete example. Consider an example website. Suppose we’re interested in the probability that a person will click on some link. If 10 visitors come to the page, and three of those people click on the link, we might guess that the click-through probability for that link is 3 /10 = 0.3 , but we wouldn’t be very certain; we only flipped the coin a small number of times. The far left panel on the figure below shows a prior we might build based on that uncertainty. It is peaked near 0.3, but is quite wide. 3

02-fig

Now suppose that we’ve waited long enough for many, many visitors. The two subsequent panels show how the distribution evolves as we gather more data. When we’ve seen 1000 visitors, we are pretty darn certain that the click-through probability is somewhere very close to 0.3. Now imagine what happens when we’ve seen 10,000 visitors!

The subtleties of assumption

Pretty amazing, right? If we gather data long enough, we can be incredibly certain about our click-through probability. In many cases, this is true. But let’s back up a bit.

In considering Bernoulli processes there is a fundamental underlying assumption that can often be overlooked. The assumption is this: The probability of success, p, is constant from trial to trial. For most phenomena, this is a reasonable assumption. But what if it is not? If the probability varies from trial to trial and this isn’t accounted for in our Bayesian updating, then we can end up becoming very certain about an incorrect probability. Consider the following example, where our probability varies smoothly between 0.3 and 0.6 over the course of 1,000 trials.

01-fig

What happens when we do Bayesian updating with the same assumptions as above?

03-fig

Not only does the peak of our posterior jump around wildly, depending on how many trials we do, but we start becoming incredibly certain that the probability is near the dead center of our varying probability function. I like to call this spurious certainty. We have an inaccurate model and too much data! We have become too certain in our beliefs.

This may seem like a contrived case, but in actuality, it is not. In fact, we’ve seen data here at Chartbeat to suggest that the probability to click on a link is time dependent. Take the figure below, which shows the average click probability for all links on an anonymous site’s homepage on a particular day.

04-fig

The probability shows a 70% decrease from the beginning of the day to around 2 p.m., and then back up. In order to accurately depict the click-through behavior of this site's users, we have to take this variation into account to avoid spurious certainty.

Using the data as a guide

How exactly can we take into consideration this time variation? We could add time directly into our Bayesian updates, but to get good data we might have to wait a long time. After all, in the general case we don’t really know what this variation looks like. Does our probability vary by time of day? Day of week? Month? Year? All of these? In reality, we probably don’t have enough time to gather enough data for our Bayesian updating to be very informative.

An alternative way is to forget about doing any sort of modeling and simply use measurements. In this method, we forget about priors and posteriors and likelihoods and just make a histogram of the data we’ve measured. We could, in effect, build an empirical form of the distributions from the figures above. Then we can update our beliefs by averaging the histogram of old data with the histogram of new data; we can even use a weighted average so anomalies will get “smoothed out.” We may not get a nice thin distribution, but at least we capture some of this temporal variation and we avoid spurious certainty. In fact, we've built our Heads Up Display, which measures click-through probabilities, to do exactly this.

The Tao of Data

In my opinion, we—and by we I mean humanity—should be ever the skeptics. In data analysis, this is paramount. Like Fisher, we should question outrageous claims and find ways to test them. We should revisit old assumptions, test them, and revisit them again. The data will guide the way, but we should always beware of spurious certainty.

Or, you know, you could always just flip a coin.

If you’d like to talk about his in more detail, perhaps over a cup of tea, contact me at dan@chartbeat or find me on Twitter.


1. The Tigers. Always bet on the Tigers.

2. Ignoring, of course, the fraction’s denominator, but that is a bit out of the scope of this post... which is math speak for laziness.

3. A note about how to read probability density functions if you are not familiar with them: Probability density functions (PDFs) are truly probability densities; that is, the area under the curve between two values on the x-axis gives the probability that our quantity-of-interest will be between those two points. That’s why the y-axis is so funny. To get the probability, we essentially need to multiply the y-axis value by distance between two values on the x-axis. If that doesn’t make any sense, just know that the best way to think about these distributions is to see where the curve is the most dense—where the greatest area under the curve is. So, the places where the peaks exist are the most probable values. I’ll blog more about distributions in the near future.

On the Facebook News Feed Algorithm Changes and Why Short-Form Content Will Be Just Fine

August 25th, 2014 by Tony

Facebook announced today that is is making some changes to its News Feed algorithm to combat clickbait. Primarily, the social network will be looking at how much time people spend reading away from Facebook.

“If they click through to a link and then come straight back to Facebook, it suggests that they didn’t find something that they wanted. With this update we will start taking into account whether people tend to spend time away from Facebook after clicking a link, or whether they tend to come straight back to News Feed when we rank stories with links in them.”

Focusing on attention and time is nothing new for Facebook. On its last earnings call, Facebook specifically spoke about the size of their market opportunity in terms of the available time and attention they were able to accrue. On a more practical note, Facebook has been factoring how much time people spend away from Facebook after clicking on an ad into its pricing algorithm for some time now. In some ways, the news today is simply a wider application of that action.

Second, the decision to enable greater previewing of links, effectively giving the visitor more information to decide whether the content is interesting to them, potentially confirms a theory that Chartbeat’s data science team has held. On average, traffic from Facebook spends about 60% more time reading than traffic from Twitter. While there are likely a number of factors in this, the more sophisticated previewing in Facebook is a clear differentiator that we think affects this.

Take together these two actions confirm that Facebook is taking its users’ experience incredibly seriously and are leaning more and more on the fundamental concepts of the Attention Web to do so. That’s good news for quality publishers everywhere.

But what does this mean for great short-form content? The one potential challenge to this was raised by Matt Galligan of the excellent news service Circa:

It’s utterly logical to be concerned that content designed for brevity would suffer under this algorithm. However, I think this underestimates the comparative wealth of attention that even content designed to be brief gets. The depressing truth of the Internet is that short-form content hangs out on the same end of the distribution curve of the Internet as long form when it comes to attention.

As I’ve mentioned elsewhere, the majority of pageviews on the internet get fewer than 15 seconds of engagement. Facebook is looking for those incidences when people come ‘straight back’ to the feed, suggesting that the threshold they’ve set for clickbait may be rather low. If your content matches the intent of your headline (ie. you’re selling what you’re promising), then you’re highly likely to beat Facebook’s threshold even with short form.

Bottom line: Focus on creating quality content, match it with an accurate headline, and you’ll be fine.

Automatic Storytelling: Or, How to Build Your Very Own Data Scientist

August 19th, 2014 by Justin

As a data scientist, one of my primary jobs is to interpret large amounts of data and craft a story about about what I discover. Not everyone is a data geek that wants to wade through large data sets, so it’s important to find a way to communicate insights from data that everyone can understand. Unfortunately, the process of analyzing data and compiling interesting results can be very time consuming. Even so, after telling these stories many times, some patterns emerge in the data analysis and communication of the findings. This led the data science team at Chartbeat to ask the question: Can we build an automated data scientist that can seek out interesting stories within our data? Before we tackle "automated storytelling," as we call it, let’s walk through the process I might go through when analyzing some data. Consider the small data set about pageviews, tweets, and average Engaged Time to one article on a completely hypothetical website.

PAGEVIEWS
# OF TWEETS
AVG. ENGAGED TIME
WEBSITE.COM
10,000
200
38 Seconds
ARTICLE
50
5
30 Seconds

The first row contains information about all articles on website.com; the second row contains information about one page on the site. So, after looking at this data, I might come up with an insight like the following:

Even though “article” had below average engagement for website.com,” readers shared this story 5 times more often than the typical story.

Let's break down where this insight came from. We see that “article” had five tweets, but without context, this does not tell us much. A great way to give context to a number is to compare it to a benchmark. For example, how does this number compare to the typical article on this website or the Internet as a whole? Put into the context of a larger picture, we can judge if a number is good or not. In this case, we are given all we need to know about Twitter shares across the site, so let's compare Twitter activity on “article” to the average article on “website.com.”  However, since the overall site has much more traffic than “article,” comparing the number of tweets for each would be unfair. When comparing numbers, it is important to compare apples to apples. The standard way to deal with this is to normalize your values. In this case, we consider the tweet rate for both. That is the number of tweets per pageview:

Twitter share rate = number of tweets / pageviews

The table above then becomes:

TWITTER SHARE RATE
AVG. ENGAGED TIME
WEBSITE.COM
1 / 50
38 Seconds
ARTICLE
1 / 10
30 Seconds

Now we are prepared to ask the following question: Was this page shared a lot? Or, how did the share rate of “article” compare to “website.com”? We answer:

“Article” was shared once per 10 pageviews, 5 times more than the typical article on “website.com."

This gives us an interesting one dimensional fact about the article. To get a more complete picture of how this article performed, however, it would be better to consider multiple data points. In this case, we also have access to information about how much time users spent engaging with content, so we can use that as well.

We ask an additional question: Was this story engaging? Or, how did the average engagement of “article” compare to the typical story on “website.com”? We answer:

Readers of “article” spent an average of 30 seconds actively reading content, which is less than the typical story on “website.com.”

As we ask and answer additional questions about the data for “article,” we start to get a more complete picture of the success of the story. In fact, if we combine this information, we start to build a story about our data, and in this case we will end up with something similar to what we stated above.

In summary, we performed a two-step process where we answered two questions:

  1. Was “article” shared a lot on Twitter?
  2. Was “article” engaging?

Since both of these questions have two possible answers, yes or no, we have four total possible scenarios. This can be represented as a decision tree like the following:

storytelling-01

For “article” we answered YES to question 1 and NO to question 2. This corresponds to the following path in our decision tree:

storytelling-02

Repeating this procedure with another story, we might end up in a different branch of the tree. For example, consider the new data set:

TWITTER SHARE RATE
AVG. ENGAGED TIME
WEBSITE.COM
1 / 50
38 Seconds
ARTICLE #1
1 / 10
30 Seconds
ARTICLE #2
1 / 60
50 Seconds

When we ask the same series of questions for “article #2”, we would follow this path:

storytelling-03

And we could formulate a sentence like the following:

While “article #2” was shared less than the typical story, the content really captured its audiences’ attention with readers spending 32% more time engaged than the typical article.

In fact, we can create a different sentence for each of the four scenarios, so that no matter the situation we find ourselves in, we will have a sentence which describes the data in an interesting way. So, for a general article on “website.com” we could do the following:

  1. Ask a series of questions which create a path through a decision tree.
  2. Grab a template sentence that corresponds to the final destination of the decision tree.
    • For example, let X be an article with a Twitter share rate Y times larger than the site.
    • Even though X had below average engagement, readers shared this story Y times more often than the typical story.
  3. Fill in the blanks:
    • Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.

Now we have found a way to automatically generate a basic sentence about tweets and engagement on an article, but what more can we do to make this feel like real insights from a human data scientist?

Above we created one template sentence per branch of the decision tree.  A simple trick we can play to give this process a more human touch is to increase the variety in the sentences that are created. For example, we could take the sentence:

Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.

...and restate it as:

1 of every 10 people who read “article” tweeted about it, 500% more often than the average story on “website.com.”  On the other hand, these readers only spent 30 seconds engaging with this content on average.

Rather than writing one template sentence per branch of the decision tree, we can create a collection of templates. This serves to create an illusion of a real data scientist telling you these facts, and will prevent the results from getting stale. We can also use additional data to include related data points. For example, in the case when the story is active on Twitter, we could enhance our original insight in the following way:

Even though “article” had below average engagement, readers shared this story 5 times more often than the typical story. In fact, the tweet from “user” generated 20 pageviews and 100 total seconds of engagement for this story.  

Every time a question is asked in the decision tree, if there is additional data available, we can automatically add in extra information to flesh out the narrative.

While the example presented was very simple, there are endless possible data points to consider in general. We can extend the method described here to more complex scenarios where there are many more levels to the decision tree with multiple paths at each node.

This is the general framework:

  1. Create a decision tree where each node corresponds to a question about your data set.
    • Note: It is better to make sure that all of your questions are related in some way. The more related the questions are, the more coherent the final story will be.
  2. Create a list of sentences that represents the story at the end of branch of the decision tree.
  3. For a given data set, ask the series of questions to create a path through the tree.
  4. After answering each question, create supporting facts that will help flesh out the story.
  5. Randomly choose one of the template sentences corresponding to this branch of the tree and fill in the blanks.
In fact, this methodology powers the automated research tool within Chartbeat’s Advertising Platform, which automatically creates insights and data narratives about the ad campaigns on your site. For example, this is a narrative about how an ad campaign performed on a sites top content.

automated-research

This was constructed by following the red path through this decision tree in way that is very similar to the example we walked through above:

storytelling-05So, what do you think? We'd love to hear about your applications of this methodology.

How Long Does It Take to Get Accurate A/B Test Results?

August 7th, 2014 by Andy

It’s no surprise that A/B testing headlines is all the rage these days—A/B testing is a proven method of increasing conversions and teaches editors how to write better headlines. But because stories don’t live forever, it’s good to ask a few questions when trying to decide if headline testing is right for you.

We’d all love it if we could instantly know if one headline does better than another at increasing conversions. But the reality is that an A/B test, just like any other experiment, needs to run long enough in order to give us reliable results—sometimes longer than headlines are relevant.

In this blog post, we’ll look at what it means for an experiment to be accurate and then look at what we can do to make A/B tests quick and accurate.

Any reliable experiment needs to have these two properties:
  1. If our headlines have about the same performance, statistical significance is the low probability that we accidentally interpret statistical noise as a difference in conversions.
  2. If there is actually a big difference in headline conversions, statistical power is the high probability that we detect this difference.

Suppose a headline has a 5% click-through rate (CTR) on a landing page that gets 694 visitors per hour. If this pageview rate is constant, this equates to about half a million monthly pageviews. The hypothesis we want to test is whether a new headline that we wrote performs at least 50% better or worse than our baseline—that is, if it has a CTR of less than 2.5% or greater than 7.5%.

We’ll design an experiment that has 5% statistical significance and 90% power. If our hypothesis is false, we’ll give a false positive only 5% of the time, which means that 95% of the time we’ll have the right answer. If our hypothesis is true, our experiment gets things right 90% of the time.

In the scenario above, we would have to run our A/B test for about 5 hours before we get enough trials to end our experiment. [1]

Three factors influence how quickly an experiment can achieve statistical significance and statistical power:

1. Landing page traffic. The more visitors viewing your landing page, the more quickly you have human subjects to experiment on. The time needed to complete an A/B test is inversely proportional to your traffic. For instance, if our landing page gets 1,388 visitors per hour (one million monthly pageviews) instead of 694, we reduce the amount of time needed to two hours and 30 minutes.

2. Base click through rate. Headlines that appear prominently and above the fold tend to have better click through rates, which means we quickly converge to precise estimates of click-through rates. Suppose we change our baseline CTR from 5% to 10%. Keeping our 1,388 visitors per hour, our experiment time decreases again to about one hour and 11 minutes.

3. Difference in headline performance. If two headlines perform similarly, it’ll take more trials to be sure that the small differences we’re observing aren’t just noise. Suppose that we think that our new headline is going to be either much better or much worse than our baseline. We modify our hypothesis to ask whether a new headline that we wrote performs at least 75% better or worse than our baseline. Keeping our 1,388 visitors per hour and our baseline CTR of 10%, we see that our experiment time decreases by half yet again to 32 minutes.

What does this mean for me?

  1. If your landing page gets more than a million pageviews a month, you’ll be able to reliably A/B test your top headlines before your stories get old - you’ll most likely get results within a couple of hours. With more traffic, you’ll be able to test less prominent headlines on your page and reduce the amount of time needed for your A/B tests.
  2. If your site gets less than a million pageviews a month, there’s still hope! Because click-through rates and the differences between your headlines are also major factors in determining the speed of A/B tests, A/B testing might still work for you.
  3. On a typical landing page, your top performing headlines above the fold might have a 5-7% click-through rate. As you scroll down the page, headline CTR’s tend to drop below 1% for the average link. Unless you have a massive amount of traffic (more than 10 million monthly pageviews) or have two headlines that are drastically different from each other, you’re probably going to wait more than few hours before you get results on headlines below the fold.
References

1. http://www.evanmiller.org/ab-testing/sample-size.html

Note: An earlier version of the post listed incorrect A/B testing times. Those times corresponded to different levels of statistical significance and power than the ones described in this post. The times have been corrected to reflect a statistical significance of 5% and a statistical power of 90%.