How We Built Media Monitor: A Data Science Intern Story

What We Built

If you had access to tons and tons of data, what would you build? How would you use it? What would you analyze? As a Data Science Intern this summer, I availed myself of Chartbeat’s unparalleled access to content data and built something really, really exciting — we call it: Media Monitor.

Media Monitor explores the variation in political news coverage across the United States. It’s a tool that identifies and displays news topics receiving disproportionate coverage in each of the 50 states, relative to the rest of the country. Or — select a presidential candidate, and it will show you the topics most frequently associated with that candidate in each state.

donald_trump_new_york_terms

How We Built It

Take One: Fun With Memory Errors

The data we have on news content – the information that powers Media Monitor – is stored in Elasticsearch, a document base optimized for in-depth search. With Elasticsearch, you can explore statistical properties of documents, most notably, through a “significant terms” query. This query lets you define a category of documents, called a “foreground set” – for example, “all the articles published in New York” or “articles containing the word ‘immigration’” – and then tells you what terms are disproportionately common among documents in that category relative to all the other documents (the “background set”).

This is exactly what we wanted Media Monitor to do: identify those political news topics that received more media attention in a particular state than other states AND those topics associated with one candidate more than other candidates.

For example, all the major presidential candidates have made statements about the economy, but whose message is earning media attention? Or more macroscopically, how do states differ in their coverage of different candidates? To see how Elasticsearch’s significant terms could help us answer these questions, I constructed a significant terms query against the complete content of all the articles in our Elasticsearch cluster, using “articles containing the word ‘immigration’” as the foreground set. I sent the query to our smaller test cluster…

And waited.

And waited some more.

And finally checked the health of the test cluster, and realized that the whole thing had crashed.

After digging through log files for a while, I figured out that the significant terms query had run out of memory; it crashed the entire test Elasticsearch cluster.

Our data set was just too big – but not because it contained too many documents — “too big” because it had too many unique words.

To facilitate searching, Elasticsearch stores what it calls an “inverted index,” a dictionary that maps each word to a list of all the documents that contain that word. That way, if I search for “immigration,” all it has to do is look up “immigration” in the inverted index and return to me the list of documents it already has stored right there. However, when Elasticsearch receives a significant terms query, it loads the entire inverted index into memory to do the math it uses to determine which terms are significant. If the inverted index is too big, the cluster overruns its memory and all hell breaks loose. No matter how simple or complex the parameters of our significant terms query, there was no way we could successfully run it on the complete content – with all the unique words – of all our articles.

Take Two: Defining “Relevancy”

We currently have about 15 million documents stored in our Elasticsearch cluster, and that doesn’t cause any problems. Elasticsearch is designed to handle large numbers of documents easily.

Among those 15 million documents, we have upwards of two million unique terms. To run a successful significant terms query, however, we benchmarked that we’d need to have ten thousand or fewer unique terms in our inverted index.

Fortunately, of those two million terms, the vast majority are not very useful. Many are stopwords — words like “and” or “the” that appear extremely frequently and carry almost no information. Some are words in languages other than English, which we decided to discard because most discussion of the American political process occurs in English. Many words in that index are highly specific to news topics other than politics (e.g., sports, music, or food), while others are political, but don’t tell us much on their own (like the word “politics” itself).

Since we already process the content of each article before adding it to Elasticsearch (to get word count, language, and so on), we could easily extract only the words from each article belonging to some smaller, more informative set, and store them alongside the full content. This would let us run a significant terms query just on this collection of relevant words, and we could pick the list of relevant words to keep in advance to make sure it contained fewer than ten thousand terms. I just needed to figure out how to decide which words were relevant.

As a first attempt, I examined the list of articles tagged as “politics” in our dataset and selected words from these articles in two ways. First, I used a “named entity extractor” to pull out names of people, places, and organizations. Useful topics like “immigration” or “economy” aren’t named entities, though, so to capture those, I tried a simple thresholding system: throw away terms that appear too many times in the corpus of politics articles (mostly stopwords) or too few times (mostly misspellings) and keep the words in between those two thresholds, which are the most likely to be informative.

Combining these two lists gave me roughly five thousand terms – an easier number. I modified our data pipeline to keep words from the combined list and store them separately from the full article content in Elasticsearch. Within a few days, enough new articles had been processed this way to see the results. I ran another significant terms query, using “articles published in New York” as my foreground set, but this time using the filtered version of all the articles rather than the complete article content. I sent the query to the test cluster, and this time the results came back almost immediately.

But then I looked at the actual results. The top five terms for articles from New York were “new york,” “manhattan,” “brooklyn,” “albany,” and “de blasio.” It was clear that, mathematically speaking, the significant terms query was doing exactly as advertised – we would certainly expect there to be more articles about Manhattan in New York than in other states – but this kind of obvious association wasn’t really what we were looking for.
Our wordlist contained far too many obviously location-specific terms (mostly places and names of local politicians) that would outweigh any interesting regional differences by default.

Take Three: If You Want Something Done Right…

Clearly we needed to remove the unhelpful local terms from our wordlist if we wanted to see anything other than a list of places in each state, sometimes mixed with the names of its governor, senators, representatives, or prominent mayors. The problem is that even though a named entity extractor can identify which words refer to places and which refer to people, it doesn’t know which places are in the United States, and it doesn’t know which people are local politicians. We couldn’t just discard all places (Iran, for example, is certainly politically relevant) or all people (our tool wouldn’t be very useful if it couldn’t give any information about the current presidential candidates). No — this would have to be done by hand.

I spent several hours combing through the wordlist, removing the names of states and cities in the US, current or recent state politicians (excluding those currently running for president), and more political but non-distinguishing terms (like “treaty” or “govern”) that wouldn’t mean much if you saw them listed among a state’s most significant topics. I also added some useful terms that had been omitted for various reasons (largely because they became were time-sensitive topics after I generated the original wordlist), like “planned parenthood” and “sandra bland.” The result was a much smaller list, only about 600 words. I put the new list in place and tried again.

immigration_map_colored
Finally, with this new, heavily hand-curated wordlist, significant terms queries started to give us results that made sense and actually conveyed some interesting and useful information.

News outlets in bluer states were noticeably less likely to have “benghazi” among their top terms associated with Hillary Clinton than in redder states. Both “immigration” and “undocumented” are among the top terms associated with Donald Trump in essentially every state (in addition to “megyn kelly” appearing associated with him in several states), while Jeb Bush isn’t strongly associated with any particular topics across the board – his top terms in most states are either lists of other Republican candidates, generic acknowledgements that he’s running (“presidential,” “2016”), or topics that the state is highly concerned with independent of him (e.g., “immigration” appears in Arizona, Texas and Florida with or without a Jeb Bush filter). Chris Christie is mentioned more in New Jersey than any other state, including New York and California, which both have many more total articles in our dataset. Iowa has a much larger number of articles mentioning each of the presidential candidates by name, relative to the total number of articles published in the state, than any other state at this point. The topic of “violence” is strongly associated with “gun control” and “second amendment” in New York, with topics like “ferguson,” “naacp,” “police,” and “protest” in much of the South, and with “syria,” “yemen,” and “gaza” in California.

Some of these results are unsurprising, and it’s interesting to see the data confirm our intuition; others – particularly the major differences in focus between what sorts of violence are being covered in different states – reveal something novel.

There were some major trade-offs in creating Media Monitor: because the wordlist had to be examined and altered by hand, we don’t yet have a good sustainable way to keep it updated as new topics become relevant in the news. And because Media Monitor operates off a limited wordlist, users can’t search whatever topics they want (although hopefully we’ve done a pretty good job of making sure the most important topics are included). However, in the end, we’ve managed to put together something that can provide some meaningful insight into discrepancies in political news coverage – and I learned a huge amount in the process.


More in Research