Automatic Storytelling: Or, How to Build Your Very Own Data Scientist
As a data scientist, one of my primary jobs is to interpret large amounts of data and craft a story about about what I discover. Not everyone is a data geek that wants to wade through large data sets, so it’s important to find a way to communicate insights from data that everyone can understand. Unfortunately, the process of analyzing data and compiling interesting results can be very time consuming. Even so, after telling these stories many times, some patterns emerge in the data analysis and communication of the findings. This led the data science team at Chartbeat to ask the question: Can we build an automated data scientist that can seek out interesting stories within our data?
Before we tackle “automated storytelling,” as we call it, let’s walk through the process I might go through when analyzing some data. Consider the small data set about pageviews, tweets, and average Engaged Time to one article on a completely hypothetical website.
WEBSITE.COM | |||
ARTICLE |
The first row contains information about all articles on website.com; the second row contains information about one page on the site. So, after looking at this data, I might come up with an insight like the following:
Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.
Let’s break down where this insight came from. We see that “article” had five tweets, but without context, this does not tell us much. A great way to give context to a number is to compare it to a benchmark. For example, how does this number compare to the typical article on this website or the Internet as a whole? Put into the context of a larger picture, we can judge if a number is good or not. In this case, we are given all we need to know about Twitter shares across the site, so let’s compare Twitter activity on “article” to the average article on “website.com.” However, since the overall site has much more traffic than “article,” comparing the number of tweets for each would be unfair. When comparing numbers, it is important to compare apples to apples. The standard way to deal with this is to normalize your values. In this case, we consider the tweet rate for both. That is the number of tweets per pageview:
Twitter share rate = number of tweets / pageviews
The table above then becomes:
WEBSITE.COM | ||
ARTICLE |
Now we are prepared to ask the following question: Was this page shared a lot? Or, how did the share rate of “article” compare to “website.com”? We answer:
“Article” was shared once per 10 pageviews, 5 times more than the typical article on “website.com.”
This gives us an interesting one dimensional fact about the article. To get a more complete picture of how this article performed, however, it would be better to consider multiple data points. In this case, we also have access to information about how much time users spent engaging with content, so we can use that as well.
We ask an additional question: Was this story engaging? Or, how did the average engagement of “article” compare to the typical story on “website.com”? We answer:
Readers of “article” spent an average of 30 seconds actively reading content, which is less than the typical story on “website.com.”
As we ask and answer additional questions about the data for “article,” we start to get a more complete picture of the success of the story. In fact, if we combine this information, we start to build a story about our data, and in this case we will end up with something similar to what we stated above.
In summary, we performed a two-step process where we answered two questions:
- Was “article” shared a lot on Twitter?
- Was “article” engaging?
Since both of these questions have two possible answers, yes or no, we have four total possible scenarios. This can be represented as a decision tree like the following:
For “article” we answered YES to question 1 and NO to question 2. This corresponds to the following path in our decision tree:
Repeating this procedure with another story, we might end up in a different branch of the tree. For example, consider the new data set:
WEBSITE.COM | ||
ARTICLE #1 | ||
ARTICLE #2 |
When we ask the same series of questions for “article #2”, we would follow this path:
And we could formulate a sentence like the following:
While “article #2” was shared less than the typical story, the content really captured its audiences’ attention with readers spending 32% more time engaged than the typical article.
In fact, we can create a different sentence for each of the four scenarios, so that no matter the situation we find ourselves in, we will have a sentence which describes the data in an interesting way. So, for a general article on “website.com” we could do the following:
- Ask a series of questions which create a path through a decision tree.
- Grab a template sentence that corresponds to the final destination of the decision tree.
- For example, let X be an article with a Twitter share rate Y times larger than the site.
-
Even though X had below average engagement, readers shared this story Y times more often than the typical story.
- Fill in the blanks:
-
Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.
-
Now we have found a way to automatically generate a basic sentence about tweets and engagement on an article, but what more can we do to make this feel like real insights from a human data scientist?
Above we created one template sentence per branch of the decision tree. A simple trick we can play to give this process a more human touch is to increase the variety in the sentences that are created. For example, we could take the sentence:
Even though “article” had below average engagement for “website.com,” readers shared this story 5 times more often than the typical story.
…and restate it as:
1 of every 10 people who read “article” tweeted about it, 500% more often than the average story on “website.com.” On the other hand, these readers only spent 30 seconds engaging with this content on average.
Rather than writing one template sentence per branch of the decision tree, we can create a collection of templates. This serves to create an illusion of a real data scientist telling you these facts, and will prevent the results from getting stale. We can also use additional data to include related data points. For example, in the case when the story is active on Twitter, we could enhance our original insight in the following way:
Even though “article” had below average engagement, readers shared this story 5 times more often than the typical story. In fact, the tweet from “user” generated 20 pageviews and 100 total seconds of engagement for this story.
Every time a question is asked in the decision tree, if there is additional data available, we can automatically add in extra information to flesh out the narrative.
While the example presented was very simple, there are endless possible data points to consider in general. We can extend the method described here to more complex scenarios where there are many more levels to the decision tree with multiple paths at each node.
This is the general framework:
- Create a decision tree where each node corresponds to a question about your data set.
- Note: It is better to make sure that all of your questions are related in some way. The more related the questions are, the more coherent the final story will be.
- Create a list of sentences that represents the story at the end of branch of the decision tree.
- For a given data set, ask the series of questions to create a path through the tree.
- After answering each question, create supporting facts that will help flesh out the story.
- Randomly choose one of the template sentences corresponding to this branch of the tree and fill in the blanks.
In fact, this methodology powers the automated research tool within Chartbeat’s Advertising Platform, which automatically creates insights and data narratives about the ad campaigns on your site. For example, this is a narrative about how an ad campaign performed on a sites top content.
This was constructed by following the red path through this decision tree in way that is very similar to the example we walked through above:
So, what do you think? We’d love to hear about your applications of this methodology.