Archive for the ‘Data Science’ Category

Automatic Storytelling: Or, How to Build Your Very Own Data Scientist

August 19th, 2014 by Justin

As a data scientist, one of my primary jobs is to interpret large amounts of data and craft a story about about what I discover. Not everyone is a data geek that wants to wade through large data sets, so it’s important to find a way to communicate insights from data that everyone can understand. Unfortunately, the process of analyzing data and compiling interesting results can be very time consuming. Even so, after telling these stories many times, some patterns emerge in the data analysis and communication of the findings. This led the data science team at Chartbeat to ask the question: Can we build an automated data scientist that can seek out interesting stories within our data? Before we tackle "automated storytelling," as we call it, let’s walk through the process I might go through when analyzing some data. Consider the small data set about pageviews, tweets, and average Engaged Time to one article on a completely hypothetical website.

PAGEVIEWS
# OF TWEETS
AVG. ENGAGED TIME
WEBSITE.COM
10,000
200
38 Seconds
ARTICLE
50
5
30 Seconds

The first row contains information about all articles on website.com; the second row contains information about one page on the site. So, after looking at this data, I might come up with an insight like the following:

Even though “article” had below average engagement for website.com,” readers shared this story 5 times more often than the typical story.

Let's break down where this insight came from. We see that “article” had five tweets, but without context, this does not tell us much. A great way to give context to a number is to compare it to a benchmark. For example, how does this number compare to the typical article on this website or the Internet as a whole? Put into the context of a larger picture, we can judge if a number is good or not. In this case, we are given all we need to know about Twitter shares across the site, so let's compare Twitter activity on “article” to the average article on “website.com.”  However, since the overall site has much more traffic than “article,” comparing the number of tweets for each would be unfair. When comparing numbers, it is important to compare apples to apples. The standard way to deal with this is to normalize your values. In this case, we consider the tweet rate for both. That is the number of tweets per pageview:

Twitter share rate = number of tweets / pageviews

The table above then becomes:

TWITTER SHARE RATE
AVG. ENGAGED TIME
WEBSITE.COM
1 / 50
38 Seconds
ARTICLE
1 / 10
30 Seconds

Now we are prepared to ask the following question: Was this page shared a lot? Or, how did the share rate of “article” compare to “website.com”? We answer:

“Article” was shared once per 10 pageviews, 5 times more than the typical article on “website.com."

This gives us an interesting one dimensional fact about the article. To get a more complete picture of how this article performed, however, it would be better to consider multiple data points. In this case, we also have access to information about how much time users spent engaging with content, so we can use that as well.

We ask an additional question: Was this story engaging? Or, how did the average engagement of “article” compare to the typical story on “website.com”? We answer:

Readers of “article” spent an average of 30 seconds actively reading content, which is less than the typical story on “website.com.”

As we ask and answer additional questions about the data for “article,” we start to get a more complete picture of the success of the story. In fact, if we combine this information, we start to build a story about our data, and in this case we will end up with something similar to what we stated above.

In summary, we performed a two-step process where we answered two questions:

  1. Was “article” shared a lot on Twitter?
  2. Was “article” engaging?

Since both of these questions have two possible answers, yes or no, we have four total possible scenarios. This can be represented as a decision tree like the following:

storytelling-01

For “article” we answered YES to question 1 and NO to question 2. This corresponds to the following path in our decision tree:

storytelling-02

Repeating this procedure with another story, we might end up in a different branch of the tree. For example, consider the new data set:

TWITTER SHARE RATE
AVG. ENGAGED TIME
WEBSITE.COM
1 / 50
38 Seconds
ARTICLE #1
1 / 10
30 Seconds
ARTICLE #2
1 / 60
50 Seconds

When we ask the same series of questions for “article #2”, we would follow this path:

storytelling-03

And we could formulate a sentence like the following:

While “article #2” was shared less than the typical story, the content really captured its audiences’ attention with readers spending 32% more time engaged than the typical article.

In fact, we can create a different sentence for each of the four scenarios, so that no matter the situation we find ourselves in, we will have a sentence which describes the data in an interesting way. So, for a general article on “website.com” we could do the following:

  1. Ask a series of questions which create a path through a decision tree.
  2. Grab a template sentence that corresponds to the final destination of the decision tree.
    • For example, let X be an article with a Twitter share rate Y times larger than the site.
    • Even though X had below average engagement, readers shared this story Y times more often than the typical story.
  3. Fill in the blanks:
    • Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.

Now we have found a way to automatically generate a basic sentence about tweets and engagement on an article, but what more can we do to make this feel like real insights from a human data scientist?

Above we created one template sentence per branch of the decision tree.  A simple trick we can play to give this process a more human touch is to increase the variety in the sentences that are created. For example, we could take the sentence:

Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.

...and restate it as:

1 of every 10 people who read “article” tweeted about it, 500% more often than the average story on “website.com.”  On the other hand, these readers only spent 30 seconds engaging with this content on average.

Rather than writing one template sentence per branch of the decision tree, we can create a collection of templates. This serves to create an illusion of a real data scientist telling you these facts, and will prevent the results from getting stale. We can also use additional data to include related data points. For example, in the case when the story is active on Twitter, we could enhance our original insight in the following way:

Even though “article” had below average engagement, readers shared this story 5 times more often than the typical story. In fact, the tweet from “user” generated 20 pageviews and 100 total seconds of engagement for this story.  

Every time a question is asked in the decision tree, if there is additional data available, we can automatically add in extra information to flesh out the narrative.

While the example presented was very simple, there are endless possible data points to consider in general. We can extend the method described here to more complex scenarios where there are many more levels to the decision tree with multiple paths at each node.

This is the general framework:

  1. Create a decision tree where each node corresponds to a question about your data set.
    • Note: It is better to make sure that all of your questions are related in some way. The more related the questions are, the more coherent the final story will be.
  2. Create a list of sentences that represents the story at the end of branch of the decision tree.
  3. For a given data set, ask the series of questions to create a path through the tree.
  4. After answering each question, create supporting facts that will help flesh out the story.
  5. Randomly choose one of the template sentences corresponding to this branch of the tree and fill in the blanks.
In fact, this methodology powers the automated research tool within Chartbeat’s Advertising Platform, which automatically creates insights and data narratives about the ad campaigns on your site. For example, this is a narrative about how an ad campaign performed on a sites top content.

automated-research

This was constructed by following the red path through this decision tree in way that is very similar to the example we walked through above:

storytelling-05So, what do you think? We'd love to hear about your applications of this methodology.

How Long Does It Take to Get Accurate A/B Test Results?

August 7th, 2014 by Andy

It’s no surprise that A/B testing headlines is all the rage these days—A/B testing is a proven method of increasing conversions and teaches editors how to write better headlines. But because stories don’t live forever, it’s good to ask a few questions when trying to decide if headline testing is right for you.

We’d all love it if we could instantly know if one headline does better than another at increasing conversions. But the reality is that an A/B test, just like any other experiment, needs to run long enough in order to give us reliable results—sometimes longer than headlines are relevant.

In this blog post, we’ll look at what it means for an experiment to be accurate and then look at what we can do to make A/B tests quick and accurate.

Any reliable experiment needs to have these two properties:
  1. If our headlines have about the same performance, statistical significance is the low probability that we accidentally interpret statistical noise as a difference in conversions.
  2. If there is actually a big difference in headline conversions, statistical power is the high probability that we detect this difference.

Suppose a headline has a 5% click-through rate (CTR) on a landing page that gets 694 visitors per hour. If this pageview rate is constant, this equates to about half a million monthly pageviews. The hypothesis we want to test is whether a new headline that we wrote performs at least 50% better or worse than our baseline—that is, if it has a CTR of less than 2.5% or greater than 7.5%.

We’ll design an experiment that has 5% statistical significance and 90% power. If our hypothesis is false, we’ll give a false positive only 5% of the time, which means that 95% of the time we’ll have the right answer. If our hypothesis is true, our experiment gets things right 90% of the time.

In the scenario above, we would have to run our A/B test for about 5 hours before we get enough trials to end our experiment. [1]

Three factors influence how quickly an experiment can achieve statistical significance and statistical power:

1. Landing page traffic. The more visitors viewing your landing page, the more quickly you have human subjects to experiment on. The time needed to complete an A/B test is inversely proportional to your traffic. For instance, if our landing page gets 1,388 visitors per hour (one million monthly pageviews) instead of 694, we reduce the amount of time needed to two hours and 30 minutes.

2. Base click through rate. Headlines that appear prominently and above the fold tend to have better click through rates, which means we quickly converge to precise estimates of click-through rates. Suppose we change our baseline CTR from 5% to 10%. Keeping our 1,388 visitors per hour, our experiment time decreases again to about one hour and 11 minutes.

3. Difference in headline performance. If two headlines perform similarly, it’ll take more trials to be sure that the small differences we’re observing aren’t just noise. Suppose that we think that our new headline is going to be either much better or much worse than our baseline. We modify our hypothesis to ask whether a new headline that we wrote performs at least 75% better or worse than our baseline. Keeping our 1,388 visitors per hour and our baseline CTR of 10%, we see that our experiment time decreases by half yet again to 32 minutes.

What does this mean for me?

  1. If your landing page gets more than a million pageviews a month, you’ll be able to reliably A/B test your top headlines before your stories get old - you’ll most likely get results within a couple of hours. With more traffic, you’ll be able to test less prominent headlines on your page and reduce the amount of time needed for your A/B tests.
  2. If your site gets less than a million pageviews a month, there’s still hope! Because click-through rates and the differences between your headlines are also major factors in determining the speed of A/B tests, A/B testing might still work for you.
  3. On a typical landing page, your top performing headlines above the fold might have a 5-7% click-through rate. As you scroll down the page, headline CTR’s tend to drop below 1% for the average link. Unless you have a massive amount of traffic (more than 10 million monthly pageviews) or have two headlines that are drastically different from each other, you’re probably going to wait more than few hours before you get results on headlines below the fold.
References

1. http://www.evanmiller.org/ab-testing/sample-size.html

Note: An earlier version of the post listed incorrect A/B testing times. Those times corresponded to different levels of statistical significance and power than the ones described in this post. The times have been corrected to reflect a statistical significance of 5% and a statistical power of 90%.

Traffic During the Facebook Outage

August 4th, 2014 by Josh

As you’ve all but certainly heard, Facebook had a major outage midday on Friday. Overall traffic on news sites dropped by 3%, thousands took to Twitter to voice their frustration, and, apparently, a select few called the LA Sheriff's Department. Most interestingly for us, the Facebook outage provided a natural experiment to look at what the world of web traffic looks like without Facebook. Here, I’ll delve into two issues that are particularly interesting to look at through the lens of the outage.

Facebook and dark social

So-called “dark social” traffic — traffic to articles that lacks a referrer because it comes via HTTPS or apps — is subject to endless speculation. What portion of it comes from emailed links? From links sent via instant messaging? From standard social sources like Facebook and Twitter but with the referrer obscured? From search sites that use HTTPS? By virtue of the fact that no explicit referrer is sent, it’s impossible to tell for sure. Since Facebook makes up a huge portion of non-dark traffic, one might guess that a big chunk of dark traffic is actually Facebook traffic in disguise.

Of course, during the outage virtually all Facebook traffic was stopped, so we can use that data to ask how much dark traffic was definitely not coming from Facebook. The answer? Very little of it was coming from Facebook directly. Take a look at the graph below.

traffic-fb-ds

Facebook referrals dropped by almost 70% during the outage (note that traffic didn’t drop to 0, presumably because some number of people had Facebook pages open before the outage). There’s certainly a drop in dark social, but it's not nearly as stark, and dark social traffic just before the outage was only 11% higher than at its low point during the outage. Since 70% of Facebook traffic dropped off, that would imply that at most 16% (11% / 70%) of traffic could’ve been directly attributable to Facebook.

Now, of course, we’d expect some other social sharing might be negatively impacted — if people aren’t discovering articles on Facebook, they might not be sharing them in other ways. So, that doesn’t mean that 16% of dark social traffic is from Facebook, but it does provide strong evidence that 84% of dark social traffic is something other than Facebook traffic in disguise.

Where people go in an outage

As I discussed in my last post, a huge percentage of mobile traffic comes from Facebook. Given that, we’d probably expect mobile traffic to be hardest hit during the outage. And, indeed, entrances to sites on mobile devices were down 8.5%, when comparing the minute before the outage to the lowest point while Facebook was down.

Interestingly, though, we see the opposite effect on desktops: a 3.5% overall increase in desktop traffic after the beginning of the outage. That increase was largely fueled by a 9% increase in homepage direct traffic on sites with loyal homepage followings. We saw no increases in traffic via other referrers, including Twitter and Google News, during the outage. While we certainly can't claim that the outage was the cause of that uptick in desktop traffic, the timing is certainly notable.

traffic-desktop-direct

In short, then: our brief world without Facebook looked a bit different, albeit in predictable ways. Significantly less news was consumed on phones, slightly more homepages were visited on desktops, and 30 minutes later, when Facebook came back online, traffic returned to normal.

The Homepage, Social, and the Rise of Mobile

July 28th, 2014 by Josh

In the much-circulated New York Times Innovation Report, perhaps the most discussed graph was this one, showing a roughly 40% decline in homepage audience over the past three years.

nytimes-innovation-homepage

With that graph, innumerable articles announcing the “death of the homepage” were written, in The Atlantic, Poynter, and on numerous blogs. Most hinged on the relationship between the rise of social traffic and the decrease in homepage traffic. One thing that isn’t mentioned in most of these articles, though, is that the rise in social traffic was contemporaneous with a rise in mobile traffic, and that mobile is as much a principal part of the story as social is. Here, I’d like to explore the three-way interaction between mobile traffic, social, and homepage visitation.

Social traffic and mobile devices

The importance of social sharing on mobile devices is much discussed. (Take for example, the recent ShareThis report, which reported that 63% of Twitter activity and 44% of Facebook activity happens on mobile.) People aren’t just using social media on mobile to share articles, of course, they’re also clicking to those articles. Below, we break down the share of traffic coming from Facebook and Twitter by device across a random sample of our sites. (Note: We specifically chose sites without separate mobile sites and without mobile apps, to ensure that we’re making fair comparisons across devices.)

traffic-device

Facebook’s share of overall mobile referrals is nearly 2.7x larger than its share on desktop. Twitter’s share is 2.5x larger on mobile than on desktop. And, if anything, those numbers likely undercount the significance of social referrals, since many apps don’t forward referrer information and get thrown into the bucket of “dark social.” In some sense, then, it’s fair to say that—for most sites—mobile traffic more-or-less is social traffic.

Mobile and homepage traffic

Setting aside where visitors come from, mobile visitors are substantially less likely to interact with a site’s homepage. Below we plot, for the same collection of sites as above, the fraction of visitors that have visited any landing page (e.g. the homepage, a section front) over a month.

homepage-all

What we see is dramatic: Desktop visitors are over 4x more likely to visit landing pages than those on phones.

Is that because mobile visitors come from social sources, and social visitors are less likely to visit landing pages—a fact that’s often cited when discussing the state of homepage traffic? Or is it not an issue of referrer at all—are mobile visitors intrinsically less likely to visit landing pages? To move toward an answer, we can control for referrer and ask the same question. Below, we plot the fraction of visitors who come to the site from Facebook and then and during the same month (but not necessarily on the same visit) visit a landing page.

homepage-facebook

Comparing this graph to the previous one, three things are clear:

  1. As discussed above, mobile visitors are significantly less likely to ever visit landing pages than desktop and tablet visitors.
  2. Similarly, visitors who come from Facebook are significantly less likely to ever visit landing pages than those who come from other sources. On average, only 6% of visitors who come from Facebook ever visit a landing page, compared to nearly 14% of overall visitors.
  3. These two phenomena are to some degree independent—desktop-based Facebook visitors are half as likely to visit landing pages as other desktop-based visitors, while mobile Facebook visitors are one-third as likely to visit homepages as other mobile visitors.

It’s also worth a quick note that, in all of these respects, tablet traffic is much closer to desktop traffic than it is to mobile traffic.

Overall, this seems to be cause for substantial concern to publishers—increases in social and mobile traffic are the two most significant traffic trends of the past few years, and both are strongly associated with drops in homepage traffic. Since, as we’ve seen before, homepage visitors are typically a site’s most loyal audience, potential drops in homepage visitors should be concerning. In the short term, it’s safe to assume that a successful mobile strategy will hinge upon a steady stream of social links—that visitors won’t return unless we reach out to them directly. In the longer term, there’s a lot of work for all of us in determining how best to build an audience in a post-desktop (and potentially post-homepage) world.

Attention Web World Cup Wrap-Up: Sample Size and Variability

July 17th, 2014 by Dan

After a month of exciting matches, the Attention Web World Cup has come to a close. In a time-honored tradition (pun intended) Ghana defeated the US with a score of 30 to 25. Congratulations to everyone from Ghana who was consuming content on the web during World Cup matches; you all contributed to this amazing achievement! And to my fellow Americans: next time around, let’s spend more time reading, okay?

To wrap up the festivities, one of our designers made these awesome animations of the time course of each tournament game based on the data I pulled. These plots show the median Engaged Time for users from each country as each match progresses.

When you view these animations, you’ll likely notice that some of these countries have incredibly stable Engaged Times while others have Engaged Times that are incredibly erratic. The U.S., for instance shows a very small amount of variance in median Engaged Time, while Cote d’Ivoire and Cameroon have median Engaged Times that jump all over the place.

This behavior is a consequence of sample size. At any particular time during a match, users from many of the African countries and other smaller countries were a much smaller sample size than, say, users from the US or Australia. In statistics and data analysis, we’re always concerned about sample size for exactly the reason illustrated in many of these graphs. The variability in the sampled statistic can mask the “true” value. We can try to capture this with a distribution, but if the width of that distribution is large, then we can’t be very confident in the value of whatever measure of central tendency we choose (mean, median, mode, etc.). And sample variance depends on the inverse of the sample size, so only as the number of points we’ve sampled gets large do we have a hope that the confidence in our estimate will rise.

I’m actually quite surprised the U.S. made it so far in my scoring scheme here. I knew going into the #AWWC that some countries were sorely underrepresented in our sample. I expected a fair chance that these countries would show a falsely high median Engaged Time. If enough of the small sample of users just so happened to be long-engagement users, this would skew their results. In the Group Round this was okay, because I performed a statistical test that tried to account for this variability. There, I asked a very common statistical question: Assuming these two teams actually have the same median Engaged Time, what is the probability that I’d observe a difference in medians at least as extreme as the one I’ve observed? If that probability was low enough, then I declared Team A and Team B to have different medians, and took the higher one as the winner. But in the bracket round, we needed clear winners (no draws were allowed), so we left it up to sampling variance. For the small-sample-size teams, this was a double edged sword. They only needed a few users spending an inordinate time engaged with content to edge above the higher-sample-size teams. But, conversely, if the users they had spent very short times, that would skew towards losing. We can see, though, that this seemed to work out well for these counties—they made a great showing all the way through the AWWC.

Thinking about variability is my job, so I might be biased here (yes, a statistics pun), but I hope you enjoyed this fun exploration of our data. I hope it got you thinking about international variability in engagement, and variability of metrics in general. Tweet me @dpvalente or email me at dan@chartbeat if you want to continue the discussion.