Archive for the ‘Data Science’ Category

Economics of Ad Refreshing

November 5th, 2014 by Justin

Editor's Note: This article originally appeared in the fall 2014 issue of the Chartbeat Quarterly, our once-a-season data science magazine.

When a television program goes to commercial break, we see a series of 30-second spots, rather than one continuous advert. That three minutes of commercial time generates more collective value to advertisers when it’s split up than if it were given to a single advertiser. So what happens if we apply the same principle to ads on the Internet?

Our research suggests that the longer an ad is in view, the greater the likelihood that a person will recall the brand behind the advertisement. However, according to multiple studies, after a short period of time, the effect of time on brand recall is greatly diminished (Figure 1).

1

This means that ads with higher active exposure time have higher value to advertisers, but only to a point. So why not exploit this fact by “refreshing” an ad after a fixed amount of time?

Ad refreshing is not a new idea, but it is unpopular because ads refreshed traditionally—after a certain amount of wall clock time has passed—are unlikely to be seen. A series of non-viewable ads have no value to advertisers. On the other hand, if we refresh ads once they’ve been in view for a set amount of time, we can ensure that an ad was seen for a fair amount of time before changing it over to a new one and that the new ad will be viewed.

This is an exciting idea because refreshing ads generates a large number of new viewable impressions. Traditionally, if a user is reading a page for two minutes with an ad in view, this person will only be exposed to one ad in a given position. If we refresh each ad after it’s viewed for 30 seconds, however, each single impression becomes four, generating three additional impressions, each of which is viewed. Table 1 shows the impact of different ad refresh times on viewable impressions and average exposure times across the Chartbeat network.

2

From Table 1, we see that the number of viewable impressions on a typical site can be increased by as much as 93% if a 10-second ad refresh is used. This has the effect of almost doubling the inventory of viewable impressions on a site. On the other hand, we also see that this reduces the time that people spend with individual ads on average, because we are limiting the amount of time people can spend with an individual ad. This means that each refreshed impression has slightly less value to an advertiser than before. Because of this, we can probably expect that advertisers would require a discount to compensate for the loss of time.

So, is ad refreshing worth it? Does the value of an increased inventory of viewable impression offset the loss in value to each refreshed impression? To answer this question, we will investigate the economic ramifications of ad refreshing.

Our goal is to compare the value of the ad inventory on a typical site with and without ad refreshing.

According to research at Yahoo, the closer an ad is to the start of a session, the more likely a user is to recall the brand represented in that ad. This means that when refreshing ads, the ads shown first have more value than the ads shown later. In fact, the researchers suggest that showing more than two ads in a single session is unlikely to be effective. Therefore, for our comparison we will only analyze single ad refreshes within an ad unit and we will make the following assumptions:

  1. Value of ad exposures to an advertiser can be quantified by recognition and recall.
  2. This value to advertisers correlates directly to revenue for the publisher.
  3. The value of first and second ad impressions are represented in Figure 2 relating exposure time to recall and recognition.

3

We use these assumptions to calculate a baseline value of the ad inventory for a typical site without ad refreshing and compare this to the value of the ad inventory using different ad refresh times.

As we can see in Table 2, ad refreshing does result in an increase in ad revenue. This means that the increase in viewable impression inventory outweighs the loss in value to refreshed impressions thanks to the diminishing returns in recall shown in Figure 2.

4

Researchers agree that refreshing ads this way should increase a site’s revenue, and I think this calculation bares this out. For example, with a 10-second ad refresh the typical site gains 93% extra inventory of viewable impressions, and a 12% increase in revenue. Even with our relatively conservative calculation that only allowed for a single ad refresh, we see a healthy increase in revenue. For this reason, it seems likely that ad refreshing will be a significant source of new revenue for online publishers.


 

quarterly_800x200

Clicks, Coins, and Eight Cups of Tea

August 28th, 2014 by Dan

The story goes like this:

Sometime around 1935, the eminent statistician Ronald A. Fisher met a lady. The lady in question had a seemingly outrageous claim: She could tell, simply by taste, whether milk or tea was added to a cup first. But Fisher was skeptical, and, being the eminent statistician that he was, he developed a method by which to test her claim. He would present her with a series of eight cups; in half of those, milk would be placed in the cup first, and in the other half, tea would be placed in the cup first. Fisher would then hand the cups to the lady in random order and ask her which liquid was added first. If she performed better than chance, he'd believe her claim.

Fisher’s description of this account was one of the first applications of hypothesis testing, perhaps the most widely used—and arguably one of the most important—statistical concepts of the modern era. That it was focused on a simple two-outcome choice is not surprising. Many processes we run into every day can be modeled in this way, as a coin flip. Will the subway show up on time? Will someone click the link to this article? Will the Detroit Tigers or the Kansas City Royals win this game? 1

These kind of problems—those in which you only have two outcomes—are known in statistics as Bernoulli processes. The main parameter governing these phenomena is the probability that a trial has succeeded. In Fisher’s example, this is the probability that the lady correctly identifies whether milk or tea is added first. For web traffic, this is the probability of clicking a link. In many of these types of two-outcome problems, you want to know how likely it is that you’ll observe some number of successes in a given number of trials. For example, you may be interested in the probability that 50 people will click on a link if 100 people see it. If you make an assumption that each event (i.e., each click) is independent of the previous event, the probability that you see some number of successes can be described by the binomial distribution. With a firm understanding of Bernoulli processes and the binomial distribution, you are equipped for modeling a whole host of binary-outcome problems.

Is this a fair coin?

A binomial distribution, however, isn’t incredibly useful if we don’t know the probability of success for a single trial. Honestly, this is what we’re typically interested in finding out, and it is really what Fisher tested: He assumed the probability of a lady guessing whether milk or tea was added first was pure chance (50/50), and developed a test to see if the data were consistent with the results of the experiment. But, in general, how do we determine what this probability is?

There are two ways we can estimate the probability from a set of trials. We could simply count the number of successes we’ve had and divide by the total number of trials. For instance, if we flipped a coin 10 times and it came up heads 3 of those times, we might guess that the coin is an unfair coin, landing on its head only 30% of the time. This is all well and good, but we only flipped the coin 10 times. How certain are we that the probability is actually 0.3? Perhaps it truly is a fair coin and our sample size was just too small.

Alternatively, we could assume that our probability of interest itself has some distribution. That is, perhaps we think that the probability is about 0.3, but we concede that it could be 0.1 or 0.5 or even 0.99999. Treating our parameter as a distribution is the heart of a technique known as Bayesian inference, which is based upon Bayes rule:

bayes-rule

Don’t be intimidated by this equation—it is actually fairly intuitive. The left-hand side represents the answer to the question: Given the data we observed, how certain are we that our quantity-of-interest takes on a given value? This is called the posterior distribution. The right-hand side contains information about what we believe about the process we’re interested in. Prob(quantity-of-interest) is known as the prior distribution. This describes our initial beliefs about the quantity we’re trying to find out about; in this case, our probability of success in the Bernoulli trial. Prob(observation | quantity-of-interest) is called the likelihood. The likelihood describes what we believe the distribution of the data to be if we assume our quantity is a specific value. In our click-through/coin-flipping example, this is simply the binomial distribution. If we know the fairness of the coin p, then the probability we get M successes out of N flips follows a binomial distribution with parameters M and N. Then, a simple multiplication of our prior and our likelihood gives us our posterior. 2

The above equation may not seem very impressive, but the real power of the Bayesian method comes in when we iteratively apply the equation to update our beliefs. That is, we can use a previously calculated posterior as a prior in a next round of calculation to update our posterior. If we do this enough times, we hope to decrease our uncertainty enough so that we can confidently determine what our “true” probability is. The neat thing is that if we choose our prior intelligently, we can get the math to work out so that updates are fairly easy.

That’s the math, but here is a concrete example. Consider an example website. Suppose we’re interested in the probability that a person will click on some link. If 10 visitors come to the page, and three of those people click on the link, we might guess that the click-through probability for that link is 3 /10 = 0.3 , but we wouldn’t be very certain; we only flipped the coin a small number of times. The far left panel on the figure below shows a prior we might build based on that uncertainty. It is peaked near 0.3, but is quite wide. 3

02-fig

Now suppose that we’ve waited long enough for many, many visitors. The two subsequent panels show how the distribution evolves as we gather more data. When we’ve seen 1000 visitors, we are pretty darn certain that the click-through probability is somewhere very close to 0.3. Now imagine what happens when we’ve seen 10,000 visitors!

The subtleties of assumption

Pretty amazing, right? If we gather data long enough, we can be incredibly certain about our click-through probability. In many cases, this is true. But let’s back up a bit.

In considering Bernoulli processes there is a fundamental underlying assumption that can often be overlooked. The assumption is this: The probability of success, p, is constant from trial to trial. For most phenomena, this is a reasonable assumption. But what if it is not? If the probability varies from trial to trial and this isn’t accounted for in our Bayesian updating, then we can end up becoming very certain about an incorrect probability. Consider the following example, where our probability varies smoothly between 0.3 and 0.6 over the course of 1,000 trials.

01-fig

What happens when we do Bayesian updating with the same assumptions as above?

03-fig

Not only does the peak of our posterior jump around wildly, depending on how many trials we do, but we start becoming incredibly certain that the probability is near the dead center of our varying probability function. I like to call this spurious certainty. We have an inaccurate model and too much data! We have become too certain in our beliefs.

This may seem like a contrived case, but in actuality, it is not. In fact, we’ve seen data here at Chartbeat to suggest that the probability to click on a link is time dependent. Take the figure below, which shows the average click probability for all links on an anonymous site’s homepage on a particular day.

04-fig

The probability shows a 70% decrease from the beginning of the day to around 2 p.m., and then back up. In order to accurately depict the click-through behavior of this site's users, we have to take this variation into account to avoid spurious certainty.

Using the data as a guide

How exactly can we take into consideration this time variation? We could add time directly into our Bayesian updates, but to get good data we might have to wait a long time. After all, in the general case we don’t really know what this variation looks like. Does our probability vary by time of day? Day of week? Month? Year? All of these? In reality, we probably don’t have enough time to gather enough data for our Bayesian updating to be very informative.

An alternative way is to forget about doing any sort of modeling and simply use measurements. In this method, we forget about priors and posteriors and likelihoods and just make a histogram of the data we’ve measured. We could, in effect, build an empirical form of the distributions from the figures above. Then we can update our beliefs by averaging the histogram of old data with the histogram of new data; we can even use a weighted average so anomalies will get “smoothed out.” We may not get a nice thin distribution, but at least we capture some of this temporal variation and we avoid spurious certainty. In fact, we've built our Heads Up Display, which measures click-through probabilities, to do exactly this.

The Tao of Data

In my opinion, we—and by we I mean humanity—should be ever the skeptics. In data analysis, this is paramount. Like Fisher, we should question outrageous claims and find ways to test them. We should revisit old assumptions, test them, and revisit them again. The data will guide the way, but we should always beware of spurious certainty.

Or, you know, you could always just flip a coin.

If you’d like to talk about his in more detail, perhaps over a cup of tea, contact me at dan@chartbeat or find me on Twitter.


1. The Tigers. Always bet on the Tigers.

2. Ignoring, of course, the fraction’s denominator, but that is a bit out of the scope of this post... which is math speak for laziness.

3. A note about how to read probability density functions if you are not familiar with them: Probability density functions (PDFs) are truly probability densities; that is, the area under the curve between two values on the x-axis gives the probability that our quantity-of-interest will be between those two points. That’s why the y-axis is so funny. To get the probability, we essentially need to multiply the y-axis value by distance between two values on the x-axis. If that doesn’t make any sense, just know that the best way to think about these distributions is to see where the curve is the most dense—where the greatest area under the curve is. So, the places where the peaks exist are the most probable values. I’ll blog more about distributions in the near future.

Automatic Storytelling: Or, How to Build Your Very Own Data Scientist

August 19th, 2014 by Justin

As a data scientist, one of my primary jobs is to interpret large amounts of data and craft a story about about what I discover. Not everyone is a data geek that wants to wade through large data sets, so it’s important to find a way to communicate insights from data that everyone can understand. Unfortunately, the process of analyzing data and compiling interesting results can be very time consuming. Even so, after telling these stories many times, some patterns emerge in the data analysis and communication of the findings. This led the data science team at Chartbeat to ask the question: Can we build an automated data scientist that can seek out interesting stories within our data? Before we tackle "automated storytelling," as we call it, let’s walk through the process I might go through when analyzing some data. Consider the small data set about pageviews, tweets, and average Engaged Time to one article on a completely hypothetical website.

PAGEVIEWS
# OF TWEETS
AVG. ENGAGED TIME
WEBSITE.COM
10,000
200
38 Seconds
ARTICLE
50
5
30 Seconds

The first row contains information about all articles on website.com; the second row contains information about one page on the site. So, after looking at this data, I might come up with an insight like the following:

Even though “article” had below average engagement for website.com,” readers shared this story 5 times more often than the typical story.

Let's break down where this insight came from. We see that “article” had five tweets, but without context, this does not tell us much. A great way to give context to a number is to compare it to a benchmark. For example, how does this number compare to the typical article on this website or the Internet as a whole? Put into the context of a larger picture, we can judge if a number is good or not. In this case, we are given all we need to know about Twitter shares across the site, so let's compare Twitter activity on “article” to the average article on “website.com.”  However, since the overall site has much more traffic than “article,” comparing the number of tweets for each would be unfair. When comparing numbers, it is important to compare apples to apples. The standard way to deal with this is to normalize your values. In this case, we consider the tweet rate for both. That is the number of tweets per pageview:

Twitter share rate = number of tweets / pageviews

The table above then becomes:

TWITTER SHARE RATE
AVG. ENGAGED TIME
WEBSITE.COM
1 / 50
38 Seconds
ARTICLE
1 / 10
30 Seconds

Now we are prepared to ask the following question: Was this page shared a lot? Or, how did the share rate of “article” compare to “website.com”? We answer:

“Article” was shared once per 10 pageviews, 5 times more than the typical article on “website.com."

This gives us an interesting one dimensional fact about the article. To get a more complete picture of how this article performed, however, it would be better to consider multiple data points. In this case, we also have access to information about how much time users spent engaging with content, so we can use that as well.

We ask an additional question: Was this story engaging? Or, how did the average engagement of “article” compare to the typical story on “website.com”? We answer:

Readers of “article” spent an average of 30 seconds actively reading content, which is less than the typical story on “website.com.”

As we ask and answer additional questions about the data for “article,” we start to get a more complete picture of the success of the story. In fact, if we combine this information, we start to build a story about our data, and in this case we will end up with something similar to what we stated above.

In summary, we performed a two-step process where we answered two questions:

  1. Was “article” shared a lot on Twitter?
  2. Was “article” engaging?

Since both of these questions have two possible answers, yes or no, we have four total possible scenarios. This can be represented as a decision tree like the following:

storytelling-01

For “article” we answered YES to question 1 and NO to question 2. This corresponds to the following path in our decision tree:

storytelling-02

Repeating this procedure with another story, we might end up in a different branch of the tree. For example, consider the new data set:

TWITTER SHARE RATE
AVG. ENGAGED TIME
WEBSITE.COM
1 / 50
38 Seconds
ARTICLE #1
1 / 10
30 Seconds
ARTICLE #2
1 / 60
50 Seconds

When we ask the same series of questions for “article #2”, we would follow this path:

storytelling-03

And we could formulate a sentence like the following:

While “article #2” was shared less than the typical story, the content really captured its audiences’ attention with readers spending 32% more time engaged than the typical article.

In fact, we can create a different sentence for each of the four scenarios, so that no matter the situation we find ourselves in, we will have a sentence which describes the data in an interesting way. So, for a general article on “website.com” we could do the following:

  1. Ask a series of questions which create a path through a decision tree.
  2. Grab a template sentence that corresponds to the final destination of the decision tree.
    • For example, let X be an article with a Twitter share rate Y times larger than the site.
    • Even though X had below average engagement, readers shared this story Y times more often than the typical story.
  3. Fill in the blanks:
    • Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.

Now we have found a way to automatically generate a basic sentence about tweets and engagement on an article, but what more can we do to make this feel like real insights from a human data scientist?

Above we created one template sentence per branch of the decision tree.  A simple trick we can play to give this process a more human touch is to increase the variety in the sentences that are created. For example, we could take the sentence:

Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.

...and restate it as:

1 of every 10 people who read “article” tweeted about it, 500% more often than the average story on “website.com.”  On the other hand, these readers only spent 30 seconds engaging with this content on average.

Rather than writing one template sentence per branch of the decision tree, we can create a collection of templates. This serves to create an illusion of a real data scientist telling you these facts, and will prevent the results from getting stale. We can also use additional data to include related data points. For example, in the case when the story is active on Twitter, we could enhance our original insight in the following way:

Even though “article” had below average engagement, readers shared this story 5 times more often than the typical story. In fact, the tweet from “user” generated 20 pageviews and 100 total seconds of engagement for this story.  

Every time a question is asked in the decision tree, if there is additional data available, we can automatically add in extra information to flesh out the narrative.

While the example presented was very simple, there are endless possible data points to consider in general. We can extend the method described here to more complex scenarios where there are many more levels to the decision tree with multiple paths at each node.

This is the general framework:

  1. Create a decision tree where each node corresponds to a question about your data set.
    • Note: It is better to make sure that all of your questions are related in some way. The more related the questions are, the more coherent the final story will be.
  2. Create a list of sentences that represents the story at the end of branch of the decision tree.
  3. For a given data set, ask the series of questions to create a path through the tree.
  4. After answering each question, create supporting facts that will help flesh out the story.
  5. Randomly choose one of the template sentences corresponding to this branch of the tree and fill in the blanks.
In fact, this methodology powers the automated research tool within Chartbeat’s Advertising Platform, which automatically creates insights and data narratives about the ad campaigns on your site. For example, this is a narrative about how an ad campaign performed on a sites top content.

automated-research

This was constructed by following the red path through this decision tree in way that is very similar to the example we walked through above:

storytelling-05So, what do you think? We'd love to hear about your applications of this methodology.

How Long Does It Take to Get Accurate A/B Test Results?

August 7th, 2014 by Andy

It’s no surprise that A/B testing headlines is all the rage these days—A/B testing is a proven method of increasing conversions and teaches editors how to write better headlines. But because stories don’t live forever, it’s good to ask a few questions when trying to decide if headline testing is right for you.

We’d all love it if we could instantly know if one headline does better than another at increasing conversions. But the reality is that an A/B test, just like any other experiment, needs to run long enough in order to give us reliable results—sometimes longer than headlines are relevant.

In this blog post, we’ll look at what it means for an experiment to be accurate and then look at what we can do to make A/B tests quick and accurate.

Any reliable experiment needs to have these two properties:
  1. If our headlines have about the same performance, statistical significance is the low probability that we accidentally interpret statistical noise as a difference in conversions.
  2. If there is actually a big difference in headline conversions, statistical power is the high probability that we detect this difference.

Suppose a headline has a 5% click-through rate (CTR) on a landing page that gets 694 visitors per hour. If this pageview rate is constant, this equates to about half a million monthly pageviews. The hypothesis we want to test is whether a new headline that we wrote performs at least 50% better or worse than our baseline—that is, if it has a CTR of less than 2.5% or greater than 7.5%.

We’ll design an experiment that has 5% statistical significance and 90% power. If our hypothesis is false, we’ll give a false positive only 5% of the time, which means that 95% of the time we’ll have the right answer. If our hypothesis is true, our experiment gets things right 90% of the time.

In the scenario above, we would have to run our A/B test for about 5 hours before we get enough trials to end our experiment. [1]

Three factors influence how quickly an experiment can achieve statistical significance and statistical power:

1. Landing page traffic. The more visitors viewing your landing page, the more quickly you have human subjects to experiment on. The time needed to complete an A/B test is inversely proportional to your traffic. For instance, if our landing page gets 1,388 visitors per hour (one million monthly pageviews) instead of 694, we reduce the amount of time needed to two hours and 30 minutes.

2. Base click through rate. Headlines that appear prominently and above the fold tend to have better click through rates, which means we quickly converge to precise estimates of click-through rates. Suppose we change our baseline CTR from 5% to 10%. Keeping our 1,388 visitors per hour, our experiment time decreases again to about one hour and 11 minutes.

3. Difference in headline performance. If two headlines perform similarly, it’ll take more trials to be sure that the small differences we’re observing aren’t just noise. Suppose that we think that our new headline is going to be either much better or much worse than our baseline. We modify our hypothesis to ask whether a new headline that we wrote performs at least 75% better or worse than our baseline. Keeping our 1,388 visitors per hour and our baseline CTR of 10%, we see that our experiment time decreases by half yet again to 32 minutes.

What does this mean for me?

  1. If your landing page gets more than a million pageviews a month, you’ll be able to reliably A/B test your top headlines before your stories get old - you’ll most likely get results within a couple of hours. With more traffic, you’ll be able to test less prominent headlines on your page and reduce the amount of time needed for your A/B tests.
  2. If your site gets less than a million pageviews a month, there’s still hope! Because click-through rates and the differences between your headlines are also major factors in determining the speed of A/B tests, A/B testing might still work for you.
  3. On a typical landing page, your top performing headlines above the fold might have a 5-7% click-through rate. As you scroll down the page, headline CTR’s tend to drop below 1% for the average link. Unless you have a massive amount of traffic (more than 10 million monthly pageviews) or have two headlines that are drastically different from each other, you’re probably going to wait more than few hours before you get results on headlines below the fold.
References

1. http://www.evanmiller.org/ab-testing/sample-size.html

Note: An earlier version of the post listed incorrect A/B testing times. Those times corresponded to different levels of statistical significance and power than the ones described in this post. The times have been corrected to reflect a statistical significance of 5% and a statistical power of 90%.

Traffic During the Facebook Outage

August 4th, 2014 by Josh

As you’ve all but certainly heard, Facebook had a major outage midday on Friday. Overall traffic on news sites dropped by 3%, thousands took to Twitter to voice their frustration, and, apparently, a select few called the LA Sheriff's Department. Most interestingly for us, the Facebook outage provided a natural experiment to look at what the world of web traffic looks like without Facebook. Here, I’ll delve into two issues that are particularly interesting to look at through the lens of the outage.

Facebook and dark social

So-called “dark social” traffic — traffic to articles that lacks a referrer because it comes via HTTPS or apps — is subject to endless speculation. What portion of it comes from emailed links? From links sent via instant messaging? From standard social sources like Facebook and Twitter but with the referrer obscured? From search sites that use HTTPS? By virtue of the fact that no explicit referrer is sent, it’s impossible to tell for sure. Since Facebook makes up a huge portion of non-dark traffic, one might guess that a big chunk of dark traffic is actually Facebook traffic in disguise.

Of course, during the outage virtually all Facebook traffic was stopped, so we can use that data to ask how much dark traffic was definitely not coming from Facebook. The answer? Very little of it was coming from Facebook directly. Take a look at the graph below.

traffic-fb-ds

Facebook referrals dropped by almost 70% during the outage (note that traffic didn’t drop to 0, presumably because some number of people had Facebook pages open before the outage). There’s certainly a drop in dark social, but it's not nearly as stark, and dark social traffic just before the outage was only 11% higher than at its low point during the outage. Since 70% of Facebook traffic dropped off, that would imply that at most 16% (11% / 70%) of traffic could’ve been directly attributable to Facebook.

Now, of course, we’d expect some other social sharing might be negatively impacted — if people aren’t discovering articles on Facebook, they might not be sharing them in other ways. So, that doesn’t mean that 16% of dark social traffic is from Facebook, but it does provide strong evidence that 84% of dark social traffic is something other than Facebook traffic in disguise.

Where people go in an outage

As I discussed in my last post, a huge percentage of mobile traffic comes from Facebook. Given that, we’d probably expect mobile traffic to be hardest hit during the outage. And, indeed, entrances to sites on mobile devices were down 8.5%, when comparing the minute before the outage to the lowest point while Facebook was down.

Interestingly, though, we see the opposite effect on desktops: a 3.5% overall increase in desktop traffic after the beginning of the outage. That increase was largely fueled by a 9% increase in homepage direct traffic on sites with loyal homepage followings. We saw no increases in traffic via other referrers, including Twitter and Google News, during the outage. While we certainly can't claim that the outage was the cause of that uptick in desktop traffic, the timing is certainly notable.

traffic-desktop-direct

In short, then: our brief world without Facebook looked a bit different, albeit in predictable ways. Significantly less news was consumed on phones, slightly more homepages were visited on desktops, and 30 minutes later, when Facebook came back online, traffic returned to normal.