Identifying and Clustering News Events by Content

If you had to describe five important events that were happening in the world right now, what would they be? How would you even go about answering that question?

To start, you might visit the homepage of your favorite news site, aggregator, or publisher. But just one site won’t have everything you’re looking for — maybe you want different takes on today’s news. What you might do is collate articles across several sites, see which news events multiple publishers are reporting on, and look at different perspectives on each story.

For our Hackweek project, backend engineer Anastasis Germanidis and I developed a process to identify these trending, important, global news events automatically and in real-time, using publicly available data. With a few machine learning algorithms, we can group articles across different sites by news event and output a list of important news events being reported right now, each represented by a set of articles providing different angles on the story.

I’ll first show our results, and then talk about the data science that makes this work. Below, I’ve run our data science pipeline on the home pages of major U.S. publishers, including the New York Times, the Washington Post, and Wall Street Journal, scraping data from the afternoon of October 13. To be clear, this pipeline does not use any data from Chartbeat’s analytics products – everything we use comes from a web scraper, which sees what any reader on the web would see.

Our project captures the important events of the day through algorithms and provide multiple articles for each new story.


Results: October 13, 2015

News Event 1: Violence in Israel


News Event 2: Kansas City Fire


News Event 3: Democratic Debate


So How Does it Work?

First, we need a dataset of articles to work with. We start by using PhantomJS, an open-source web scraper, to scrape the homepages of several major U.S. publishers including the New York Times, Washington Post, and Wall Street Journal. We want articles that homepage editors think are important to today’s news, so for each page, we look at all article links above the fold on a desktop screen and pick the top ten articles by link size.

We feed our article links to Python-Goose, a Python library which extracts the content of an article given its URL. Now we have the title, description, and content of ten articles on each homepage we started with.

We want to organize our dataset of scraped articles into news events. We start by preprocessing our article text with two steps: 1) named entity extraction and 2) tf-idf vectorization. Let me explain:

Named entity extraction

This involves identifying words or phrases that correspond to names of things. We use the MITIE python library, which identifies the names of people, organizations, and locations and classifies each entity it finds into one of these three categories. For our purposes, we’re less concerned with the classification of each named entity than the identification of these words and phrases. We extract all instances of named entities in each article to use for the next step of our pipeline.

Because news events almost always can be uniquely identified by the names of people, organizations, and locations involved, named entity extraction is an effective way of filtering out relatively unimportant terms while retaining important information — think of it as an extension of stop-word removal.

tf-idf Vectorizer

Scikit-learn’s tf-idf vectorizer transforms our list of named entities into a numerical vector for each article, which allows us to cluster articles with standard clustering algorithms. tf-idf stands for term frequency-inverse document frequency. In this case, term frequency is the number of times a named entity appears in an article divided by the number of total entities in the article. Document frequency is the fraction of articles in our dataset that contain a particular named entity. For a given entity and article, term frequency-inverse document frequency is simply the term frequency divided by the document frequency.

Roughly speaking, tf-idf gives a higher weight to entities that appear frequently in the article but less frequently in other articles.

Each dimension of an article’s tf-idf vector represents the tf-idf statistic for a particular word in our vocabulary. In this pipeline, our vocabulary contains all entities that have appeared at least once in our article dataset.

We cluster our tf-idf article vectors using an algorithm called spectral clustering, again using scikit-learn. Spectral clustering consists of three steps: first, we use the similarity of tf-idf vectors between pairs of articles to construct a similarity matrix of our data. We perform dimensionality reduction on this matrix using an eigenvalue decomposition, and finally use the k-means algorithm on this low-dimensional matrix to obtain our article clusters. We’ve found that for a dataset with 60 articles from six publishers, clustering into seven or eight groups works well.

Why didn’t we use a probabilistic topic model such as Latent Dirichlet Allocation? We found that topic models such as LDA give you clusters that roughly correspond to sections, such as technology, science, and politics, and not individual news events. This is perhaps because these algorithms allow for an article to belong to multiple topics instead of forcing a hard classification. This doesn’t make sense if topics are to correspond to news events – we know that an article will rarely report on more than one news story.

Here’s a diagram of our full pipeline.
diagram

What’s Next?

Recently, Twitter released a product called Moments, which organizes tweets into events using a team of human curators. We want to use our automated process to do the same with news articles, and we’re working towards a web application that displays our news events in real-time.

By using algorithms to evaluate the importance of news stories, we give you an easy way to figure out what’s happening in the world right now — without having to organize articles yourself or even wait for human curators.


More in Research