Archive for the ‘Data Science’ Category

Attention Web World Cup: Weekend 1 Update

June 16th, 2014 by Dan


Last week was the start of the World Cup, which meant the kickoff of Chartbeat’s Attention Web World Cup. We’re just 11 matches in and we’ve already seen some pretty awesome games. Some of my favorites include: The Netherlands soundly defeating the defending champs; Costa Rica surprising Uruguay with a 3-1 upset; and Switzerland scoring in the 93rd minute to defeat Ecuador.

But, for those of you who were disappointed in the performance of your teams over the weekend, here’s your chance for redemption. Below are the scores for how the teams fared this weekend in the Attention Web World Cup, and they are quite different than the outcome of the “real” World Cup.

Engagement between countries is very similar … this is truly anyone’s cup!

(Winning score highlighted  in green, draws in yellow.)




Wait, how does a draw work in the AWWC?

Many of you will notice that in some games a two-second differential, for example, will result in a win for one of the teams, yet in another game, a two-second differential will result in a draw. Take, for example, the Cote d’Ivoire/Japan matchup. Japan had a median Engaged Time of 26.0 seconds, and Cote d’Ivoire had only 20.0 seconds. A six-second differential, but we had a draw? What’s with that?

As I said in the last post, I determine the winner in a statistical manner. Over the course of the game, I sample Engaged Time for users from each country for the top 20 articles on each of Chartbeat’s sites. This results in a distribution of times for each team. To determine a winner, I ask, statistically, whether these two distributions are different. In other words, I try to determine that if I had a large enough sample of Engaged Times for each country, would it turn out that one country consistently had a larger median Engaged Time? The problem—and this is a fundamental concept in statistics—is that the size of a sample is directly related to the precision with which you can judge your statistic of interest. In our case, this amounts to the fact that the more data we have, the narrower the margin can be for us to determine a winner.

And here’s the rub: For countries like Cote d’Ivoire and Japan, we didn’t have many samples to look at. With these distributions, there is too much variability in the data for us to precisely determine whether the 26-second median we measured for Japan is, in actuality, truly larger than the 20-second median we measured for Cote d’Ivoire. We just can’t know if Japan had such a large median only because of the particular sample we drew in comparison to Cote d’Ivoire’s sample.

In this way, the Attention Web World Cup is quite democratic. Those countries whose web presence across our sites isn’t very large don’t automatically get relegated to the bottom of the heap, they have a good chance at getting 1 point through a draw.

Keep checking back for updates and tweet about your favorites using #AWWC.

Boa Sorte e Divirta-se!

Kicking Off the Attention Web World Cup

June 13th, 2014 by Dan


The World Cup kicked off yesterday with a fantastic (albeit controversial) game between Brazil and Croatia, with Brazil winning 3 - 1. Over the next month, we soccer fans will be glued to our screens—tablets, phones, TVs, you name it. We’ll be watching games, replaying highlights, reading articles, checking stats, tweeting, messaging our friends. The World Cup will consume our lives.

As a soccer fan, I started worrying that this would result in a significant drop in my productivity, which, you know, wouldn’t be that great since we here at Chartbeat are constantly working to deliver fantastic products and provide key data insights. I got to thinking about how many of our customers will be putting out amazing content covering the World Cup, how I will likely spend a lot of time reading this content, and how I’ll be doing my best to ensure that our data will help you deliver your content as effectively as possible. This is when an idea struck me.

Chartbeat should hop on the World Cup bandwagon, and use our data to say something insightful about engagement during the games. Those of you who stare at the Chartbeat Publishing Dashboard all day know that content is consumed differently by visitors from different countries, and as a Chartbeat Data Scientist, I started wondering exactly how engagement with content varied across country. Could I do an analysis to provide some interesting, useful insights into how users from different locations consumed content? And then, I said to myself: Bah! Let’s just have some fun.

So, Chartbeat would like to introduce the Attention Web World Cup, a friendly international competition to see which country has the most engagement with content. We’ll be pitting each country in the World Cup against each other as they consume content on the web during World Cup matches.

How does a match in the AWWC work?

At the same time that a World Cup match is being played, we sample engagement across all of Chartbeat’s sites, filtered by users from the each of the countries that are playing in the current match. About every five minutes during the game, I take the top twenty articles on each of our domains as judged by the number of concurrent users on that article, and then I grab the average engagement time of each of those articles. This is done for each country separately. I calculate a score at the end of 90 minutes by looking at the median of these engagement times. The country with the highest median Engaged Time is the winner.

I thought I’d be fair about this, so I score in a statistical manner, since there is a distribution of times. A draw is possible, if the medians aren’t statistically significant. How did we choose this scoring scheme? Did we have a big meeting here in the Hall of Justice at Chartbeat Studios to decide on the rules of the game? Were there knock-down, drag-out arguments between members of our data science team? No. I just arbitrarily decided this one day last week and stuck with it. So, yeah, there’s that.

So how did the first match of the #AWWC play out?

Match 1: Brazil v. Croatia

Screen Shot 2014-06-13 at 1.09.25 PM

And Croatia is the winner! It was quite a close game, but even at the half, Croatia was ahead. Looking at the distribution of Engaged Times shows that engagement was, in fact, quite similar between the two countries. See below: Brazil is in yellow, Croatia is in red.


A larger proportion of Brazilians spent little time engaged on content, and a small percentage of Croatians spent a large amount of time engaged with content. These were the superstars for the Croatian team in the first match of the Attention Web World Cup, and pushed their team to victory.

We can conjecture as to why this is so: Perhaps fans of the Croatian team were scouring sites on the net to find explanations for why those two calls (the penalty and the no-goal) went the way they did, or perhaps Croatians were re-reading all the articles that said Croatia had absolutely no chance of beating Brazil. Were the Brazilian fans more focused on the game, and less on their “second screens?”  Whatever the reason, Croatian fans can find some solace in the fact that, at least in the Attention Web World Cup, they came out ahead.

I’m going to be doing this for every match in the Group stage of the World Cup, followed by the tournament style bracket, so keep on checking our blog over the next month for updates on how your favorite team fares. You can tweet me @dpvalente for further discussion about the scoring scheme or anything else related to data/analysis.

Download the Summer 2014 Quarterly

June 12th, 2014 by Kyle

When we speak with publishers and advertisers, one thing is always very clear to us: Almost as if by their very nature, they're ready with pen and pad to learn everything there is to know about the ever-evolving web. We're right there with 'em. If you follow along on our blog (which, of course, you should do...), then you know that our whip-smart crew of data scientists spends a lot of time sifting through mounds of data and writing about their findings.

You've also probably seen some of their work and gnarly charts in publications such as TIME, SlatePoynterThe Washington Post, and Adweek. But there's so much more that we want to learn and share with you. So today, we're inaugurating the Chartbeat Quarterly, a once-a-season magazine that dives headlong into the data science of online publishing and advertising. This summer, we're exploring "Matters of Time," an all-time favorite topic of ours.

We're scratching the surface, but download the Quarterly and let us know what you think!



Audience Building on A Case Study

April 2nd, 2014 by Josh

Want to know more about traffic sources how they can help you understand your audience's behavior? Download our guide.

Over the past year, we’ve published extensive research on how to use data to understand and build your audience — everything from the effects of Engaged Time to scrolling behaviors and traffic sources driving traffic to the sites in our network. All of the data in those pieces are combined from a set of customers who allow us to use their data in anonymous, aggregated form. Looking at statistics aggregated from across a wide swath of sites is interesting because it lets us identify network-wide facts.

But, subtle patterns often get averaged out, so it’s hard to tell a nuanced story using aggregated data. Today, in partnership with New York Magazine and Rick Edmonds and Sam Kirkland of Poynter, we’re excited to present something different: a deep look into the data for one site, New York Magazine’s, about what factors drive visitor loyalty. (A quick note: This data is presented with the consent of New York Magazine and Chartbeat never shares customer-specific data.)

If you’re going to read one piece, I’d highly encourage you to click over and read  the Poynter team's piece, which contains much of the data given below, as well as extensive feedback from the Vulture team. But, we also wanted to present our own take on the data, which you’ll find below. Our goal is less to provide answers than to get you thinking about what questions you might ask of your own site.


How We Define “Loyalty” and Why It's Important to Measure

Before we can look at how visitors become loyal to a site, the first thing to do is define loyalty. Informally, by “loyal” we mean something like “a person who is highly likely to continue to return to the site across time.” For instance, a person might be loyal to the site of their daily newspaper. One way of getting toward a specific definition using the data is by asking how many times a person must visit before we’re nearly certain they’ll continue to return. In the figure below, we plot the probability that a person will return to, given the number of times they’ve already been to the site.

There are perhaps three things worth noting on this plot:

  1. Visitors who have come once so far in a month are just over 20% likely to return.

  2. That rate of return climbs rapidly until we reach visitors who have visited five or six times. Once a person has come five or six times in a month, we can be highly confident that they’ll continue to return.

  3. The downward slope on the right side of the graph is a windowing effect because we’re looking at one month of data: people are unlikely to come every single day in a month, so once a visitor has come more than about 22 times their probability of returning more times begins to decrease.

Based on this, a reasonable definition of a “loyal” visitor is one who visits at least five times in a month — after a person has come five times, we have a strong belief that they’ll continue to come back.

The Relationship Between Time of Day and Return Rate

After asking if visitors returned to the site, the next question was when visitors returned. One of the most striking data points we found was that visitors are far more likely to return at the same time of day as that of their initial visit — those who first visit the site today at noon are most likely to come back to the site tomorrow at noon, and so on. While that pattern is significant throughout the day, for Vulture it’s substantially stronger for visitors who come in the afternoon and evening, as demonstrated in the figure below.


In this figure, we’re comparing two sets of visitors: those who first arrive on a Wednesday between 10:00 a.m. and 10:59 a.m. and those who arrive on the same day, but between 6:00 p.m. and 6:59 p.m. The red lines show what hours of the day the 10 a.m. visitors return to the site throughout the rest of the month, and the blue lines represent the same statistics for the 6 p.m. visitors. For both audiences, the vast majority of time spent on other days of the week is at the same time of day — for instance, the 10 a.m. audience is most likely to return on Tuesday, Wednesday, or Thursday at about 10 a.m. What’s striking, though, is that the 6 p.m. audience spends dramatically more time on site throughout the week when compared to the 10 a.m. crowd. It’s worth noting that, though we’re showing traffic from Wednesday morning and evening, the basic pattern holds for those who arrive at other hours on other days.

One theory might be that this variation is caused by a difference in topics consume — perhaps, for instance, readers are engaging with Vulture's TV coverage during the afternoon and evening. Interestingly, we saw no evidence that this is the case: the breakdown of traffic by topic is roughly constant throughout the day. On the other hand, this variation in return times lines up extraordinarily well with device usage. In the early daytime, when traffic is less likely to return, upwards of 40% of traffic is mobile. In the evening, when traffic is much more predictable and more likely to return, mobile falls to only 22% of overall traffic.

This data raises more questions than it answers: What can be done to get the morning audience to come back more frequently? How can editors take advantage of the daily patterns of their evening readers? Answering those questions is out of the scope of this article, but the upshot here is that there is a hugely interesting opportunity in understanding behavior as it relates to time of day.

Improving Return Rates of New Visitors

Obviously, one key challenge for any publication is in getting new, incidental visitors to move down the funnel toward loyalty. We saw three factors that exhibited significant influence over a new visitor’s probability of returning: how they arrived at the site, the type of content they landed on, and how much time they spent reading.

Vulture’s top referrers are similar to what we see across the internet, as are their relative rates of return. Unsuprisingly, new visitors coming from its sister site are most likely to return (22%), followed those from Twitter (16%) and Buzzfeed (10%). Perhaps surprisingly, the length of an article proved to be a strong predictor of likelihood to return, as shown below.


Stepping through this graph from left to right:

  1. Visitors who land on the shortest articles are extremely unlikely to return, but their probability of return rapidly increases from there.

  2. Those who view the Vulture hompage, forming the first peak at about 3900 pixels, are substantially more likely to return than those who view average-length articles — this article, for example — which are 4000-4500 pixels high.

  3. However, those who visit longer articles — this article, for example — are substantially more likely to return.

We see similar trends when we look at the time that a visitor spends reading whatever page they land on.


Visitors who spend substantial time reading on the first page they land on are also much more likely to return to the site. Overall, this confirmed an editorial hunch the Vulture team had, that they were better off moving away from extremely short pieces of content.

But that’s the Vulture team specifically; shorter posts may work best for your site. We dove into this study with precisely because every site is different: the content is different, the people visiting are different, the goals and metrics are different. I hope you and your team will see this data as a starting point for everything you can be looking at and acting on. There's a lot more richness to your site's data than purely traffic numbers. If you need help getting started and knowing what to look for — Chartbeat or not — just send me an email at

How Long Are Viewable Impressions Actually Seen?

March 23rd, 2014 by Alex Carusillo

On Friday, Digiday wrote a piece examining some assumptions that are all too often made about the way people read on the internet. It covered a bunch of our favorite stuff including that the conventionally “good” advertising spots aren’t necessarily in the places people read. In addition, Lucia introduced something we’ve been thinking about a lot lately: the duration of an impression.

Over the past year, the industry has finally rallied around a viewability metric. As a result we’ve seen a lot of premium publishers do great work to make their ads more viewable and - in turn - pull way ahead of their lower quality competitors. Naturally now, more and more ads are becoming viewable across the internet. Which begs a new question we hear almost daily from publishers: how do I prove that my inventory is actually better than the alternative?

Everyone is looking to sell based on reader attention, but I’ve yet to meet someone who thinks that the viewable impression actually helps them do that -- particularly when it comes to premium sites.

So we’ve been working with publishers, helping them take the next step: understanding how long people are actively focused on content while an ad is on the screen.

There’s a lot of research out there that says the more time people spend with an ad, the more likely that ad is to succeed, but this research rarely looks at how long real ads on the internet perform.

We decided to find out.

We took a look across a select group of publishers to find out how long ads are seen when they’re seen.  

Turns out, half of all viewable ads are seen for 1-5 seconds while the other half are, obviously, seen for longer than that. The natural reaction is to look at each number of seconds and sort of consider them “engagement points” and just assume that a higher number means better ads, but it’s actually not that black and white.

Research has shown over and over that at about ten seconds of exposure diminishing returns start to set in and each additional second is worth less in terms of recall than the previous ones; that doesn’t mean, however, that every ad should seek to be seen for ten seconds. It means that different ads are right for different goals.

If an advertiser is trying to get their name out there as efficiently as possible in a pure awareness play, they likely want to be buying a spot where shorter impression duration occurs. Whereas if an advertiser is trying to get a specific message out there or telling a more complex story than just their logo appearing in an ad, they should look for one in the 6-15 second range.

This, of course, leaves a big chunk of impressions that are longer than that ideal engagement time and don’t really help advertisers any more than 15 seconds ones. We’re working with a handful of people to see what kind of creative things they can do to solve that, getting the most of their inventory and giving their audience fresh content that benefits everyone involved -- the publisher’s business, the brand’s goals, the reader’s interest.

The point is, not every viewable impression is equal but that doesn’t mean that the shorter ones are categorically worse. It means that we should think about the goals of a campaign and which impressions are the right way to achieve these goals.