Hannah Keiler was our Fall 2013 Data Science intern here at Chartbeat, working with Chief Data Scientist Josh Schwartz. Hannah is a senior at Columbia University, where she studies Statistics with a concentration in Computer Science. This blog post details one of several projects she tackled during her internship at Chartbeat.
At Chartbeat, we sometimes want to compare metrics across similar sites. There are several different ways to group sites. For example, you can begin by thinking about grouping sites by size – comparing metrics like number of readers or articles published each day. We were also interested in grouping together sites that write about similar content. Grouping sites by content manually for thousands of domains is incredibly tedious, so we wanted to devise a metric that would allow us to group similar sites automatically.
One way to define sites as having similar content is if they write on similar subjects at around the same time. If sites write about the same subjects, they are probably using the same key words, like “Obama” or “Syria.” We knew that the words that best summarize the content of an article are likely the words appearing in its headline. Keeping these ideas in mind, we developed our metric.
We start by comparing sites two at a time. Let’s call the sites A and B. We look at the words used in the headlines in A and B day by day.
For each day, we record the words used in both A and B and compute a weighted sum of their counts. That means that we divide the number of times a certain word occurs in both A and B in one day by a number indicating how often that word occurs in headlines in general. Weighting the word counts helps us to pick out two sites that write about niche topics by giving more weight to rarer words. For each day, we then sum up all of these values and then we sum up all of the values for all of the days. Let’s call this final sum “Value 1.”
We also record all of the words used in headlines by either A or B for each day. Then for each day we compute a weighted sum of these word counts and then add up all the weighted sums from each of the days into one value. Let’s call this “Value 2.”
Then we divide Value 1 by Value 2. We now have a ratio of sorts of the number of words A and B share versus the number of words they use in total.
How does this look?
We first computed the similarity metric for sites whose content we thought was geared towards sports, music, or celebrity/entertainment news. To visualize the similarity metric, we plotted the sites as nodes in connected graph.
FYI: These graphs are anonymized because we don't share individual client data
The distance between the sites represents their similarity. Closer sites have a stronger similarity metric. On this graph, the sports sites are dark blue, the celebrities sites are red, and the music sites are teal. As you can see, sites with similar content group together! The fact that the celebrity sites are in the middle implies that they share some content with music and sports sites, which makes sense. The outlier posts fewer articles daily than the other celebrity news sites, so there was less overlap in term usage and, accordingly, the similarity metric was lower.
We also tried out our metric with British and Australian news. We get the graph below.
Here, the UK sites in red group together and the Australian sites in teal group together. The outlier writes more niche news stories than general Australian news, so it had less overlap with the other Australian and British news sites.
These initial results show that sites that post articles with the same topics in the headlines at around the same time tend to be similar types of sites. Moving ahead, this could be a great way to group sites into different categories based on their content.