Author Archive

If you had to describe five important events that were happening in the world right now, what would they be? How would you even go about answering that question?

To start, you might visit the homepage of your favorite news site, aggregator, or publisher. But just one site won’t have everything you’re looking for — maybe you want different takes on today’s news. What you might do is collate articles across several sites, see which news events multiple publishers are reporting on, and look at different perspectives on each story.

For our Hackweek project, backend engineer Anastasis Germanidis and I developed a process to identify these trending, important, global news events automatically and in real-time, using publicly available data. With a few machine learning algorithms, we can group articles across different sites by news event and output a list of important news events being reported right now, each represented by a set of articles providing different angles on the story.

I’ll first show our results, and then talk about the data science that makes this work. Below, I’ve run our data science pipeline on the home pages of major U.S. publishers, including the New York Times, the Washington Post, and Wall Street Journal, scraping data from the afternoon of October 13. To be clear, this pipeline does not use any data from Chartbeat’s analytics products – everything we use comes from a web scraper, which sees what any reader on the web would see.

Our project captures the important events of the day through algorithms and provide multiple articles for each new story.


Results: October 13, 2015

News Event 1: Violence in Israel


News Event 2: Kansas City Fire


News Event 3: Democratic Debate


So How Does it Work?

First, we need a dataset of articles to work with. We start by using PhantomJS, an open-source web scraper, to scrape the homepages of several major U.S. publishers including the New York Times, Washington Post, and Wall Street Journal. We want articles that homepage editors think are important to today’s news, so for each page, we look at all article links above the fold on a desktop screen and pick the top ten articles by link size.

We feed our article links to Python-Goose, a Python library which extracts the content of an article given its URL. Now we have the title, description, and content of ten articles on each homepage we started with.

We want to organize our dataset of scraped articles into news events. We start by preprocessing our article text with two steps: 1) named entity extraction and 2) tf-idf vectorization. Let me explain:

Named entity extraction

This involves identifying words or phrases that correspond to names of things. We use the MITIE python library, which identifies the names of people, organizations, and locations and classifies each entity it finds into one of these three categories. For our purposes, we’re less concerned with the classification of each named entity than the identification of these words and phrases. We extract all instances of named entities in each article to use for the next step of our pipeline.

Because news events almost always can be uniquely identified by the names of people, organizations, and locations involved, named entity extraction is an effective way of filtering out relatively unimportant terms while retaining important information — think of it as an extension of stop-word removal.

tf-idf Vectorizer

Scikit-learn’s tf-idf vectorizer transforms our list of named entities into a numerical vector for each article, which allows us to cluster articles with standard clustering algorithms. tf-idf stands for term frequency-inverse document frequency. In this case, term frequency is the number of times a named entity appears in an article divided by the number of total entities in the article. Document frequency is the fraction of articles in our dataset that contain a particular named entity. For a given entity and article, term frequency-inverse document frequency is simply the term frequency divided by the document frequency.

Roughly speaking, tf-idf gives a higher weight to entities that appear frequently in the article but less frequently in other articles.

Each dimension of an article’s tf-idf vector represents the tf-idf statistic for a particular word in our vocabulary. In this pipeline, our vocabulary contains all entities that have appeared at least once in our article dataset.

We cluster our tf-idf article vectors using an algorithm called spectral clustering, again using scikit-learn. Spectral clustering consists of three steps: first, we use the similarity of tf-idf vectors between pairs of articles to construct a similarity matrix of our data. We perform dimensionality reduction on this matrix using an eigenvalue decomposition, and finally use the k-means algorithm on this low-dimensional matrix to obtain our article clusters. We’ve found that for a dataset with 60 articles from six publishers, clustering into seven or eight groups works well.

Why didn’t we use a probabilistic topic model such as Latent Dirichlet Allocation? We found that topic models such as LDA give you clusters that roughly correspond to sections, such as technology, science, and politics, and not individual news events. This is perhaps because these algorithms allow for an article to belong to multiple topics instead of forcing a hard classification. This doesn’t make sense if topics are to correspond to news events – we know that an article will rarely report on more than one news story.

Here’s a diagram of our full pipeline.
diagram

What’s Next?

Recently, Twitter released a product called Moments, which organizes tweets into events using a team of human curators. We want to use our automated process to do the same with news articles, and we’re working towards a web application that displays our news events in real-time.

By using algorithms to evaluate the importance of news stories, we give you an easy way to figure out what’s happening in the world right now — without having to organize articles yourself or even wait for human curators.

It’s no surprise that A/B testing headlines is all the rage these days—A/B testing is a proven method of increasing conversions and teaches editors how to write better headlines. But because stories don’t live forever, it’s good to ask a few questions when trying to decide if headline testing is right for you.

We’d all love it if we could instantly know if one headline does better than another at increasing conversions. But the reality is that an A/B test, just like any other experiment, needs to run long enough in order to give us reliable results—sometimes longer than headlines are relevant.

In this blog post, we’ll look at what it means for an experiment to be accurate and then look at what we can do to make A/B tests quick and accurate.

Any reliable experiment needs to have these two properties:

  1. If our headlines have about the same performance, statistical significance is the low probability that we accidentally interpret statistical noise as a difference in conversions.
  2. If there is actually a big difference in headline conversions, statistical power is the high probability that we detect this difference.

Suppose a headline has a 5% click-through rate (CTR) on a landing page that gets 694 visitors per hour. If this pageview rate is constant, this equates to about half a million monthly pageviews. The hypothesis we want to test is whether a new headline that we wrote performs at least 50% better or worse than our baseline—that is, if it has a CTR of less than 2.5% or greater than 7.5%.

We’ll design an experiment that has 5% statistical significance and 90% power. If our hypothesis is false, we’ll give a false positive only 5% of the time, which means that 95% of the time we’ll have the right answer. If our hypothesis is true, our experiment gets things right 90% of the time.

In the scenario above, we would have to run our A/B test for about 5 hours before we get enough trials to end our experiment. [1]

Three factors influence how quickly an experiment can achieve statistical significance and statistical power:

1. Landing page traffic.

The more visitors viewing your landing page, the more quickly you have human subjects to experiment on. The time needed to complete an A/B test is inversely proportional to your traffic.

For instance, if our landing page gets 1,388 visitors per hour (one million monthly pageviews) instead of 694, we reduce the amount of time needed to two hours and 30 minutes.

2. Base click through rate.

Headlines that appear prominently and above the fold tend to have better click through rates, which means we quickly converge to precise estimates of click-through rates.

Suppose we change our baseline CTR from 5% to 10%. Keeping our 1,388 visitors per hour, our experiment time decreases again to about one hour and 11 minutes.

3. Difference in headline performance.

If two headlines perform similarly, it’ll take more trials to be sure that the small differences we’re observing aren’t just noise.

Suppose that we think that our new headline is going to be either much better or much worse than our baseline. We modify our hypothesis to ask whether a new headline that we wrote performs at least 75% better or worse than our baseline. Keeping our 1,388 visitors per hour and our baseline CTR of 10%, we see that our experiment time decreases by half yet again to 32 minutes.

What does this mean for me?

  1. If your landing page gets more than a million pageviews a month, you’ll be able to reliably A/B test your top headlines before your stories get old – you’ll most likely get results within a couple of hours. With more traffic, you’ll be able to test less prominent headlines on your page and reduce the amount of time needed for your A/B tests.
  2. If your site gets less than a million pageviews a month, there’s still hope! Because click-through rates and the differences between your headlines are also major factors in determining the speed of A/B tests, A/B testing might still work for you.
  3. On a typical landing page, your top performing headlines above the fold might have a 5-7% click-through rate. As you scroll down the page, headline CTR’s tend to drop below 1% for the average link. Unless you have a massive amount of traffic (more than 10 million monthly pageviews) or have two headlines that are drastically different from each other, you’re probably going to wait more than few hours before you get results on headlines below the fold.

References

1. http://www.evanmiller.org/ab-testing/sample-size.html

Note: An earlier version of the post listed incorrect A/B testing times. Those times corresponded to different levels of statistical significance and power than the ones described in this post. The times have been corrected to reflect a statistical significance of 5% and a statistical power of 90%.