Archive for the ‘Data Science’ Category

How Engaged Time Affects Reading Comprehension

December 22nd, 2014 by Chris

We recently got an interesting question from a client about the connection between engaged time and understanding in news articles. A priori, one may think that there should be a strong correlation–someone quickly skimming through an article should not be expected to retain as much as someone carefully reading–but there are some reasons this might not be the case.

Journalists are taught to get to the point quickly in their news articles, using an inverted pyramid style. One of the presumed benefits is to allow readers to exit the story at any point they like and still retain the important information from the story. One of the most common phrases a young journalist hears is often “don’t bury the lede” for this very reason. Perhaps, then, it might be that a reader gleans most of his/her understanding from a story from the first seconds of reading, and the marginal value of spending more time reading the rest of the story is relatively small. We decided to put the assertion to the test.

First, let’s you and I conduct an informal experiment right now. I’m going to present you with a few dense bullet points of the findings of the study, and I’d like you to decide for yourself whether you feel like you have anything to gain from reading further.

  • We conducted a survey of over 1000 people to investigate the association between how long a reader is engaged reading a news article and what s/he takes away from it.
  • We confirmed that there is a strong association between fact recall and engaged time.
  • Readers engaged for more than a minute were almost twice as likely to recall specific facts about the article as readers who spent less than 15 seconds. This was true even when the fact was found in the first lines of the article.
  • Further, we found evidence that readers who spend more time engaged are more likely to agree with the author’s conclusions.

Now that you’ve gotten the facts, you can feel free to leave this blog post and spend your valuable time elsewhere, but if you’ll indulge me, I think you’ll find that your understanding may be heightened if you let me go into a little more detail about the experiment and the results.

Experimental Design

In order to test the relationship between reading comprehension and engagement, we carefully considered how best to induce paid participants to act like internet users reading news articles. We’ve previously established that it’s not common for a reader to read all the way through an article. In a typical article, the most likely (modal) behavior for a reader is to leave after only about 15 seconds. The most simple survey design would be to instruct readers to read through an article however they like and then ask questions about it afterwards.  The problem with structuring an experiment like this is that a paid participant just doesn’t act like a typical internet user. We tried this with a quick pilot study. The average reading time for the article we selected was an order of magnitude higher than what you might expect for a typical reader.

It struck me that one important consideration that this design was obviously lacking was an element of choice. When a reader visits your site, s/he is making a choice to spend valuable time reading your content instead of reading a different article, looking at funny cat pictures, or spending time off of the internet entirely. In the naive design, we’d effectively purchased time from people to take our survey, so they felt compelled to “do a good job” and carefully read through the article, even though that’s not what we wanted them to do at all.

Once we settled on a design that included elements of choice, we got more sensible results. We put together a simple website that showed one of eight different news articles with a button to flip from one to to the next. The participants were instructed to read as much or as little of each article as they liked. After five minutes, the site would redirect users to a survey page that asked five questions about one of the articles. The survey had five multiple choice questions about one particular opinion article about Iranian air strikes:

  1. A detail question asking about a fact from the first paragraph of the article.
  2. A detail question about a fact from near the end of the article.
  3. An “attention check” designed to weed out respondents who were not reading the questions.
  4. A conceptual question asking for a summary of the author’s thesis
  5. An opinion question relating to the author’s message.

We asked 1,000 paid participants on Amazon’s Mechanical Turk to take our survey. Of these, we eliminated about 10% for various reasons (e.g., failing the attention check question, multiple submissions from the same device, not clicking through to the article we asked questions about). When we look at an article, our engagement metrics look reasonable:


The graph above shows a peak engagement of about 20 seconds, gradually tapering off as time increases. This reassures us that we are sampling a population that models internet viewership reasonably well.

We can also intuitively understand the pattern of responses.


The majority of readers could correctly identify a detail from the beginning of the article and summarize the author’s thesis, but fewer were able to answer a question about a detail near the end of the article. This can be explained by the fact that relatively few people read through the article. In truth, the 37% that answered the question correctly is likely as high as it is only because you could expect 25% correct from random chance.

Fact Recall by Engaged Time

Looking deeper, we find a strong association between fact recall and engaged time. Readers spending more than a minute were almost twice as likely to recall specific facts about the article as readers who spend less than 15 seconds (approximately the top and bottom quartile of the engaged time of the responses).

Let’s look specifically at conceptual understanding. Roughly 40% of participants engaged for less than 15 seconds correctly assessed the message of the article, compared to more than 80% of those engaged for more than a minute. We’ve plotted recall against engagement below, with its associated 95% confidence interval. The slope of the logistic regression tells us that for every 15 seconds of increased engagement we can expect to see about a 30% increase in the odds of correctly answering the question correctly.


The complete results are summarized in the following table. In particular, for each of the questions, we see a positive association between recall and engaged time, even when the relevant information was found at the very beginning of the article.

Respondents Answering Correctly by Engaged Time
Question Overall < 15 seconds >1 minute Increase in Odds Per 15 Second Increase in Engaged Time
Conceptual Understanding 62% 42% 81% 32%
Detail from Beginning of Article 63% 39% 81% 31%
Detail from End of Article 37% 27% 44% 8%

A matter of opinion

The opinion question was adapted from a June 2014 CBS News/New York Times Poll: “Do you favor or oppose the United States working with Iran in order to try and resolve the situation in Iraq?” In the original poll, 53% were in favor, 39% opposed, and 8% were unsure. The first observation about the results is that respondents were much less likely to express an opinion, but given the survey population this is not surprising. The second observation is that readers of this article, which supported this position, were relatively more likely to agree with the author’s position. In particular, the portion of responses agreeing with the author varied significantly between people who engaged with the article for more than a minute and those who engaged for less than 15 seconds:


I caution that the experiment does not show causation. It’s probably the case, for instance, that readers who more strongly agree with the author are more likely to be more engaged with the article. However, it is at least plausible that reading the article helped impact the opinion of the readers (though we would have to do more tests to find out how much this might be the case in general).

It’s worth explicitly noting that we only effectively ran this experiment for one particular article. Whereas we think the questions we’ve raised and attempted to answer are important, and that the results we’ve shown are useful directional indications, it’s clear that the magnitude of the results will depend on the questions asked and the article itself.

Win Hearts and Minds

We’ve seen before that engaged time affects things like brand recall for advertisements, so the result that engaged time affects reading comprehension is not altogether surprising. What is interesting is the extent to which this is borne out in the results.  This just adds to the growing body of evidence that capturing the attention of your readers gives you the opportunity to win both their hearts and their minds.

For writers, I suppose this conclusion is both a blessing and a curse. Yes, you’re still going to have to spend time polishing the second halves of your articles; but by focusing on keeping your readers engaged, they will ultimately take away more from what you’re saying. And isn’t that the point of effective journalism?

The State of Dark Social in 2014

December 4th, 2014 by Chris

Here at Chartbeat, we have a long history of trying to shed light on the sources of your traffic. Since 2012, we’ve helped illuminate the phenomenon known as dark social—where traffic is likely to come from social sources, yet lacks explicit referrer attribution. Two years later, Internet traffic looks a lot different than it previously did. Mobile and application traffic have grown significantly. More sites are moving to HTTPS. Usage patterns are evolving. We wanted to take the opportunity to look into the current state of dark social and dive deeply not only into potential causes, but also potential disambiguations of this nebulous block of traffic. As a result of these investigations, we found a way to attribute a sizable chunk of dark social (up to half!) to application traffic.

So what exactly is dark social? Here’s a brief recap. Back in the Wild West of web analytics, we tagged any traffic coming in without a referrer field as direct traffic. Many people have attempted to exhaustively list what might cause an empty referrer field, but it the typical explanation of a visitor typing in the URL directly was unsatisfying for article content. The alternate explanation that these visitors came from IMs, emails, or apps seemed much more likely, and so we categorized them as social instead. This social traffic came to be known as “dark social” and has made regular appearances at the top of referrer lists ever since.

Dark Social Volume

These days, dark social accounts for about a third of external traffic to sites across our network. The exact amount varies quite a bit depending on the particular site in question, but most sites have a chunk ranging from significant to extremely significant. The following graph shows a rough distribution of the percent of external traffic classified as dark social for a given domain for a sample of Chartbeat’s data, with the mean given in red:











We can break this data out further. The number is markedly higher on mobile, with upwards of 50% of mobile external traffic lacking a referrer on some sites. This is already a critical problem — how are we to analyze our top traffic drivers if we can’t attribute half of our traffic? — and since mobile’s share of traffic is increasing, it’s only going to get worse.











Potential Sources of Dark Social

In order to get a handle on the drivers of the problem, we did an empirical analysis of potential sources of dark social by setting up a site, posting links to it on various traffic sources, and clicking those links from a wide variety of traffic sources. The goal was to determine which traffic sources can be reliably assumed to not be dark social (because they always successfully set the referrer) and which do contribute to dark social (because they always lack a referrer or sometimes lack a referrer).

We were specifically interested in looking at some of the most popular social mobile apps. The following table shows whether some of the combinations of sites and modes of interaction successfully passed a referrer in our testing (with the caveat that we only tested the current versions of the applications and were not exhaustive using all different browsers and operating systems):

Referrer Passed?
Mobile Browser
Mobile App

We can see that major traffic sources are generally good about allowing their data to be tracked. However, there were some interesting exceptions.

  1. Facebook’s desktop site sometimes doesn’t set a referrer if the onclick listener is avoided (for instance, if you open a link in a new tab/window).
  2. Desktop and mobile traffic from sets a referrer, but the top apps for reading Reddit all do not set a referrer.
  3. It became clear in further analysis (see below) that the Facebook app only sometimes sets the referrer.

Beyond these notes, things generally worked as expected: email, IM, and most mobile apps were dark social; social networks and major sources of external traffic (even some using HTTPS: like Facebook and Google) were not.

Disambiguating Traffic with Time Series

The above findings raised more questions than they settled. If less well measured sources like email and IM drive a significant portion of traffic, do they at least correlate well with more explicitly measurable sources of traffic? For applications like Facebook and Reddit that do not always send referrer data, is there a way to identify their contributions within patterns of dark social traffic? We found that in many cases, the answers to these questions were a resounding yes.

For the next phase of our analysis, we wanted to take a look at the time series data for specific articles to try to identify patterns in the traffic. If a popular story were to break, you’d expect to see different responses in different traffic sources. For a site like Reddit, you might expect traffic to be tightly peaked and highly correlated with the story’s ranking on the home page. For a site like Facebook, the interest might fade out more gradually as it filters through different people’s feeds. You might expect instant messaging to yield a tighter, shorter-tailed traffic distribution than a medium like email. The following plot shows an interesting example of a story that illustrates some of these features. There was a distinct spike in Reddit-driven traffic that lasted all of four hours followed by a more prolonged pickup in Facebook traffic.











The most interesting observation here is how well correlated dark social traffic is to the identifiable sources. In this example, you could be convinced that the dark social is really just misattributed traffic from facebook and reddit. Some evidence for this:

  • The residual traffic is almost non-existent, and in particular, the amount of internal and search traffic is negligible.
  • If a secondary social sharing mechanism like email or IM were driving a significant amount of traffic, we’d expect to see some delay in the dark social time series from the sharp spike in traffic.

We can further break these numbers down by examining the difference between mobile and desktop traffic. In the following graph, we zoom into the Reddit spike in traffic in the above article.











We can see a stark divergence in traffic patterns by device, which confirms some of our earlier findings. We have at least strong anecdotal evidence that large portion of Reddit mobile traffic is from apps categorizing traffic as dark social.

We can examine the patterns of traffic for the Facebook-driven portion of the time series as well:











Here, the picture is not quite as cut-and-dry as before. Dark social comprises only a small percent of overall desktop traffic, but commands a fairly significant chunk of mobile traffic. Over other articles, this pattern is typical. When we observe Facebook traffic, we can almost always find a corresponding amount of dark social traffic. The actual amount of dark social traffic relative to Facebook traffic can vary significantly by article and by site, but will generally be much higher on mobile devices. As Facebook is such a large driver of mobile traffic in general, this can help explain some of the difference we see between desktop dark social share and mobile dark social share.

Of course, it’s difficult to disambiguate where dark social is coming from at scale — it’s a mix of traffic from many referrers. But, for a large majority of stories, if we look at the top 10 referrers and correlate the time series of traffic that they send with dark social’s time series, we get some referrer that’s a very high match, which strongly suggests that that particular story is getting its dark social from that particular referrer.

This suggests that, while we can’t just flip a switch and disambiguate all traffic, a careful analysis of a particular story is likely to be able to turn up the source of the majority of its dark social. Of course, this won’t always work– there are still person-to-person shares (IM, email, etc), shares on apps with no corresponding website, and so forth that account for a chunk of dark social. Still, if we look at correlations between dark social traffic and other traffic sources (a rudimentary and blunt tool to be sure), we see that fewer than 25% of stories have time series that have less than 80% correlation, with many being much more highly correlated.

Disambiguating App Traffic Using User Agents

In this analysis, we discovered that many major apps set a string in the user agent that can be used to identify the app, even in the case that the app doesn’t set a referrer. Facebook, Buzzfeed, Twitter, QQ, Baidu, and others all do this. By looking at this user agent string and using it to identify the referrer, we’re able to disambiguate a non-trivial portion of dark social traffic and correctly attribute it to specific mobile apps. We recently implemented this change, and if you happened to be looking closely at your dashboard around 6pm last night, you might have seen your traffic jump up by 40% and your dark social fall by 5-10% when we flipped the switch. While this is only a small piece of the overall dark social share, it is a clear step in the right direction. As more apps take similar measures, this approach has the potential to help reverse the growth of the dark social problem.

Going Forward

As we get more data from the User Agent change, it will be interesting to see how much of the relationship between dark social and some of the major applications remains. Will the relationship between dark social and Facebook mobile traffic disappear? Well, probably not, because there will still be people who see a link on Facebook and then share it through text or email or other means.

Still, the general approach of looking into your articles’ traffic patterns is quite fruitful — you’re likely to be able to identify the source of dark social for specific stories if you choose to dive in (feel free to reach out if you’d like advice on how to do it using our historical APIs).

As always, we’ll be keeping a keen eye on dark social. Please feel free to reach out to me with any questions, specifically about your traffic or generally about dark social, at

Economics of Ad Refreshing

November 5th, 2014 by Justin

Editor’s Note: This article originally appeared in the fall 2014 issue of the Chartbeat Quarterly, our once-a-season data science magazine.

When a television program goes to commercial break, we see a series of 30-second spots, rather than one continuous advert. That three minutes of commercial time generates more collective value to advertisers when it’s split up than if it were given to a single advertiser. So what happens if we apply the same principle to ads on the Internet?

Our research suggests that the longer an ad is in view, the greater the likelihood that a person will recall the brand behind the advertisement. However, according to multiple studies, after a short period of time, the effect of time on brand recall is greatly diminished (Figure 1).


This means that ads with higher active exposure time have higher value to advertisers, but only to a point. So why not exploit this fact by “refreshing” an ad after a fixed amount of time?

Ad refreshing is not a new idea, but it is unpopular because ads refreshed traditionally—after a certain amount of wall clock time has passed—are unlikely to be seen. A series of non-viewable ads have no value to advertisers. On the other hand, if we refresh ads once they’ve been in view for a set amount of time, we can ensure that an ad was seen for a fair amount of time before changing it over to a new one and that the new ad will be viewed.

This is an exciting idea because refreshing ads generates a large number of new viewable impressions. Traditionally, if a user is reading a page for two minutes with an ad in view, this person will only be exposed to one ad in a given position. If we refresh each ad after it’s viewed for 30 seconds, however, each single impression becomes four, generating three additional impressions, each of which is viewed. Table 1 shows the impact of different ad refresh times on viewable impressions and average exposure times across the Chartbeat network.


From Table 1, we see that the number of viewable impressions on a typical site can be increased by as much as 93% if a 10-second ad refresh is used. This has the effect of almost doubling the inventory of viewable impressions on a site. On the other hand, we also see that this reduces the time that people spend with individual ads on average, because we are limiting the amount of time people can spend with an individual ad. This means that each refreshed impression has slightly less value to an advertiser than before. Because of this, we can probably expect that advertisers would require a discount to compensate for the loss of time.

So, is ad refreshing worth it? Does the value of an increased inventory of viewable impression offset the loss in value to each refreshed impression? To answer this question, we will investigate the economic ramifications of ad refreshing.

Our goal is to compare the value of the ad inventory on a typical site with and without ad refreshing.

According to research at Yahoo, the closer an ad is to the start of a session, the more likely a user is to recall the brand represented in that ad. This means that when refreshing ads, the ads shown first have more value than the ads shown later. In fact, the researchers suggest that showing more than two ads in a single session is unlikely to be effective. Therefore, for our comparison we will only analyze single ad refreshes within an ad unit and we will make the following assumptions:

  1. Value of ad exposures to an advertiser can be quantified by recognition and recall.
  2. This value to advertisers correlates directly to revenue for the publisher.
  3. The value of first and second ad impressions are represented in Figure 2 relating exposure time to recall and recognition.


We use these assumptions to calculate a baseline value of the ad inventory for a typical site without ad refreshing and compare this to the value of the ad inventory using different ad refresh times.

As we can see in Table 2, ad refreshing does result in an increase in ad revenue. This means that the increase in viewable impression inventory outweighs the loss in value to refreshed impressions thanks to the diminishing returns in recall shown in Figure 2.


Researchers agree that refreshing ads this way should increase a site’s revenue, and I think this calculation bares this out. For example, with a 10-second ad refresh the typical site gains 93% extra inventory of viewable impressions, and a 12% increase in revenue. Even with our relatively conservative calculation that only allowed for a single ad refresh, we see a healthy increase in revenue. For this reason, it seems likely that ad refreshing will be a significant source of new revenue for online publishers.



Clicks, Coins, and Eight Cups of Tea

August 28th, 2014 by Dan

The story goes like this:

Sometime around 1935, the eminent statistician Ronald A. Fisher met a lady. The lady in question had a seemingly outrageous claim: She could tell, simply by taste, whether milk or tea was added to a cup first. But Fisher was skeptical, and, being the eminent statistician that he was, he developed a method by which to test her claim. He would present her with a series of eight cups; in half of those, milk would be placed in the cup first, and in the other half, tea would be placed in the cup first. Fisher would then hand the cups to the lady in random order and ask her which liquid was added first. If she performed better than chance, he’d believe her claim.

Fisher’s description of this account was one of the first applications of hypothesis testing, perhaps the most widely used—and arguably one of the most important—statistical concepts of the modern era. That it was focused on a simple two-outcome choice is not surprising. Many processes we run into every day can be modeled in this way, as a coin flip. Will the subway show up on time? Will someone click the link to this article? Will the Detroit Tigers or the Kansas City Royals win this game? 1

These kind of problems—those in which you only have two outcomes—are known in statistics as Bernoulli processes. The main parameter governing these phenomena is the probability that a trial has succeeded. In Fisher’s example, this is the probability that the lady correctly identifies whether milk or tea is added first. For web traffic, this is the probability of clicking a link. In many of these types of two-outcome problems, you want to know how likely it is that you’ll observe some number of successes in a given number of trials. For example, you may be interested in the probability that 50 people will click on a link if 100 people see it. If you make an assumption that each event (i.e., each click) is independent of the previous event, the probability that you see some number of successes can be described by the binomial distribution. With a firm understanding of Bernoulli processes and the binomial distribution, you are equipped for modeling a whole host of binary-outcome problems.

Is this a fair coin?

A binomial distribution, however, isn’t incredibly useful if we don’t know the probability of success for a single trial. Honestly, this is what we’re typically interested in finding out, and it is really what Fisher tested: He assumed the probability of a lady guessing whether milk or tea was added first was pure chance (50/50), and developed a test to see if the data were consistent with the results of the experiment. But, in general, how do we determine what this probability is?

There are two ways we can estimate the probability from a set of trials. We could simply count the number of successes we’ve had and divide by the total number of trials. For instance, if we flipped a coin 10 times and it came up heads 3 of those times, we might guess that the coin is an unfair coin, landing on its head only 30% of the time. This is all well and good, but we only flipped the coin 10 times. How certain are we that the probability is actually 0.3? Perhaps it truly is a fair coin and our sample size was just too small.

Alternatively, we could assume that our probability of interest itself has some distribution. That is, perhaps we think that the probability is about 0.3, but we concede that it could be 0.1 or 0.5 or even 0.99999. Treating our parameter as a distribution is the heart of a technique known as Bayesian inference, which is based upon Bayes rule:


Don’t be intimidated by this equation—it is actually fairly intuitive. The left-hand side represents the answer to the question: Given the data we observed, how certain are we that our quantity-of-interest takes on a given value? This is called the posterior distribution. The right-hand side contains information about what we believe about the process we’re interested in. Prob(quantity-of-interest) is known as the prior distribution. This describes our initial beliefs about the quantity we’re trying to find out about; in this case, our probability of success in the Bernoulli trial. Prob(observation | quantity-of-interest) is called the likelihood. The likelihood describes what we believe the distribution of the data to be if we assume our quantity is a specific value. In our click-through/coin-flipping example, this is simply the binomial distribution. If we know the fairness of the coin p, then the probability we get M successes out of N flips follows a binomial distribution with parameters M and N. Then, a simple multiplication of our prior and our likelihood gives us our posterior. 2

The above equation may not seem very impressive, but the real power of the Bayesian method comes in when we iteratively apply the equation to update our beliefs. That is, we can use a previously calculated posterior as a prior in a next round of calculation to update our posterior. If we do this enough times, we hope to decrease our uncertainty enough so that we can confidently determine what our “true” probability is. The neat thing is that if we choose our prior intelligently, we can get the math to work out so that updates are fairly easy.

That’s the math, but here is a concrete example. Consider an example website. Suppose we’re interested in the probability that a person will click on some link. If 10 visitors come to the page, and three of those people click on the link, we might guess that the click-through probability for that link is 3 /10 = 0.3 , but we wouldn’t be very certain; we only flipped the coin a small number of times. The far left panel on the figure below shows a prior we might build based on that uncertainty. It is peaked near 0.3, but is quite wide. 3


Now suppose that we’ve waited long enough for many, many visitors. The two subsequent panels show how the distribution evolves as we gather more data. When we’ve seen 1000 visitors, we are pretty darn certain that the click-through probability is somewhere very close to 0.3. Now imagine what happens when we’ve seen 10,000 visitors!

The subtleties of assumption

Pretty amazing, right? If we gather data long enough, we can be incredibly certain about our click-through probability. In many cases, this is true. But let’s back up a bit.

In considering Bernoulli processes there is a fundamental underlying assumption that can often be overlooked. The assumption is this: The probability of success, p, is constant from trial to trial. For most phenomena, this is a reasonable assumption. But what if it is not? If the probability varies from trial to trial and this isn’t accounted for in our Bayesian updating, then we can end up becoming very certain about an incorrect probability. Consider the following example, where our probability varies smoothly between 0.3 and 0.6 over the course of 1,000 trials.


What happens when we do Bayesian updating with the same assumptions as above?


Not only does the peak of our posterior jump around wildly, depending on how many trials we do, but we start becoming incredibly certain that the probability is near the dead center of our varying probability function. I like to call this spurious certainty. We have an inaccurate model and too much data! We have become too certain in our beliefs.

This may seem like a contrived case, but in actuality, it is not. In fact, we’ve seen data here at Chartbeat to suggest that the probability to click on a link is time dependent. Take the figure below, which shows the average click probability for all links on an anonymous site’s homepage on a particular day.


The probability shows a 70% decrease from the beginning of the day to around 2 p.m., and then back up. In order to accurately depict the click-through behavior of this site’s users, we have to take this variation into account to avoid spurious certainty.

Using the data as a guide

How exactly can we take into consideration this time variation? We could add time directly into our Bayesian updates, but to get good data we might have to wait a long time. After all, in the general case we don’t really know what this variation looks like. Does our probability vary by time of day? Day of week? Month? Year? All of these? In reality, we probably don’t have enough time to gather enough data for our Bayesian updating to be very informative.

An alternative way is to forget about doing any sort of modeling and simply use measurements. In this method, we forget about priors and posteriors and likelihoods and just make a histogram of the data we’ve measured. We could, in effect, build an empirical form of the distributions from the figures above. Then we can update our beliefs by averaging the histogram of old data with the histogram of new data; we can even use a weighted average so anomalies will get “smoothed out.” We may not get a nice thin distribution, but at least we capture some of this temporal variation and we avoid spurious certainty. In fact, we’ve built our Heads Up Display, which measures click-through probabilities, to do exactly this.

The Tao of Data

In my opinion, we—and by we I mean humanity—should be ever the skeptics. In data analysis, this is paramount. Like Fisher, we should question outrageous claims and find ways to test them. We should revisit old assumptions, test them, and revisit them again. The data will guide the way, but we should always beware of spurious certainty.

Or, you know, you could always just flip a coin.

If you’d like to talk about his in more detail, perhaps over a cup of tea, contact me at dan@chartbeat or find me on Twitter.

1. The Tigers. Always bet on the Tigers.

2. Ignoring, of course, the fraction’s denominator, but that is a bit out of the scope of this post… which is math speak for laziness.

3. A note about how to read probability density functions if you are not familiar with them: Probability density functions (PDFs) are truly probability densities; that is, the area under the curve between two values on the x-axis gives the probability that our quantity-of-interest will be between those two points. That’s why the y-axis is so funny. To get the probability, we essentially need to multiply the y-axis value by distance between two values on the x-axis. If that doesn’t make any sense, just know that the best way to think about these distributions is to see where the curve is the most dense—where the greatest area under the curve is. So, the places where the peaks exist are the most probable values. I’ll blog more about distributions in the near future.

Automatic Storytelling: Or, How to Build Your Very Own Data Scientist

August 19th, 2014 by Justin

As a data scientist, one of my primary jobs is to interpret large amounts of data and craft a story about about what I discover. Not everyone is a data geek that wants to wade through large data sets, so it’s important to find a way to communicate insights from data that everyone can understand. Unfortunately, the process of analyzing data and compiling interesting results can be very time consuming. Even so, after telling these stories many times, some patterns emerge in the data analysis and communication of the findings. This led the data science team at Chartbeat to ask the question: Can we build an automated data scientist that can seek out interesting stories within our data?

Before we tackle “automated storytelling,” as we call it, let’s walk through the process I might go through when analyzing some data. Consider the small data set about pageviews, tweets, and average Engaged Time to one article on a completely hypothetical website.

38 Seconds
30 Seconds

The first row contains information about all articles on; the second row contains information about one page on the site. So, after looking at this data, I might come up with an insight like the following:

Even though “article” had below average engagement for,” readers shared this story 5 times more often than the typical story.

Let’s break down where this insight came from. We see that “article” had five tweets, but without context, this does not tell us much. A great way to give context to a number is to compare it to a benchmark. For example, how does this number compare to the typical article on this website or the Internet as a whole? Put into the context of a larger picture, we can judge if a number is good or not. In this case, we are given all we need to know about Twitter shares across the site, so let’s compare Twitter activity on “article” to the average article on “”  However, since the overall site has much more traffic than “article,” comparing the number of tweets for each would be unfair. When comparing numbers, it is important to compare apples to apples. The standard way to deal with this is to normalize your values. In this case, we consider the tweet rate for both. That is the number of tweets per pageview:

Twitter share rate = number of tweets / pageviews

The table above then becomes:

1 / 50
38 Seconds
1 / 10
30 Seconds

Now we are prepared to ask the following question: Was this page shared a lot? Or, how did the share rate of “article” compare to “”? We answer:

“Article” was shared once per 10 pageviews, 5 times more than the typical article on “”

This gives us an interesting one dimensional fact about the article. To get a more complete picture of how this article performed, however, it would be better to consider multiple data points. In this case, we also have access to information about how much time users spent engaging with content, so we can use that as well.

We ask an additional question: Was this story engaging? Or, how did the average engagement of “article” compare to the typical story on “”? We answer:

Readers of “article” spent an average of 30 seconds actively reading content, which is less than the typical story on “”

As we ask and answer additional questions about the data for “article,” we start to get a more complete picture of the success of the story. In fact, if we combine this information, we start to build a story about our data, and in this case we will end up with something similar to what we stated above.

In summary, we performed a two-step process where we answered two questions:

  1. Was “article” shared a lot on Twitter?
  2. Was “article” engaging?

Since both of these questions have two possible answers, yes or no, we have four total possible scenarios. This can be represented as a decision tree like the following:


For “article” we answered YES to question 1 and NO to question 2. This corresponds to the following path in our decision tree:


Repeating this procedure with another story, we might end up in a different branch of the tree. For example, consider the new data set:

1 / 50
38 Seconds
1 / 10
30 Seconds
1 / 60
50 Seconds

When we ask the same series of questions for “article #2”, we would follow this path:


And we could formulate a sentence like the following:

While “article #2” was shared less than the typical story, the content really captured its audiences’ attention with readers spending 32% more time engaged than the typical article.

In fact, we can create a different sentence for each of the four scenarios, so that no matter the situation we find ourselves in, we will have a sentence which describes the data in an interesting way. So, for a general article on “” we could do the following:

  1. Ask a series of questions which create a path through a decision tree.
  2. Grab a template sentence that corresponds to the final destination of the decision tree.
    • For example, let X be an article with a Twitter share rate Y times larger than the site.
    • Even though X had below average engagement, readers shared this story Y times more often than the typical story.

  3. Fill in the blanks:
    • Even though “article” had below average engagement for “,” readers shared this story 5 times more often than the typical story.

Now we have found a way to automatically generate a basic sentence about tweets and engagement on an article, but what more can we do to make this feel like real insights from a human data scientist?

Above we created one template sentence per branch of the decision tree.  A simple trick we can play to give this process a more human touch is to increase the variety in the sentences that are created. For example, we could take the sentence:

Even though “article” had below average engagement for “,” readers shared this story 5 times more often than the typical story.

…and restate it as:

1 of every 10 people who read “article” tweeted about it, 500% more often than the average story on “”  On the other hand, these readers only spent 30 seconds engaging with this content on average.

Rather than writing one template sentence per branch of the decision tree, we can create a collection of templates. This serves to create an illusion of a real data scientist telling you these facts, and will prevent the results from getting stale. We can also use additional data to include related data points. For example, in the case when the story is active on Twitter, we could enhance our original insight in the following way:

Even though “article” had below average engagement, readers shared this story 5 times more often than the typical story. In fact, the tweet from “user” generated 20 pageviews and 100 total seconds of engagement for this story.  

Every time a question is asked in the decision tree, if there is additional data available, we can automatically add in extra information to flesh out the narrative.

While the example presented was very simple, there are endless possible data points to consider in general. We can extend the method described here to more complex scenarios where there are many more levels to the decision tree with multiple paths at each node.

This is the general framework:

  1. Create a decision tree where each node corresponds to a question about your data set.
    • Note: It is better to make sure that all of your questions are related in some way. The more related the questions are, the more coherent the final story will be.
  2. Create a list of sentences that represents the story at the end of branch of the decision tree.
  3. For a given data set, ask the series of questions to create a path through the tree.
  4. After answering each question, create supporting facts that will help flesh out the story.
  5. Randomly choose one of the template sentences corresponding to this branch of the tree and fill in the blanks.

In fact, this methodology powers the automated research tool within Chartbeat’s Advertising Platform, which automatically creates insights and data narratives about the ad campaigns on your site. For example, this is a narrative about how an ad campaign performed on a sites top content.


This was constructed by following the red path through this decision tree in way that is very similar to the example we walked through above:

storytelling-05So, what do you think? We’d love to hear about your applications of this methodology.