Archive for the ‘Data Science’ Category

Attention Web World Cup Wrap-Up: Sample Size and Variability

July 17th, 2014 by Dan

After a month of exciting matches, the Attention Web World Cup has come to a close. In a time-honored tradition (pun intended) Ghana defeated the US with a score of 30 to 25. Congratulations to everyone from Ghana who was consuming content on the web during World Cup matches; you all contributed to this amazing achievement! And to my fellow Americans: next time around, let’s spend more time reading, okay?

To wrap up the festivities, one of our designers made these awesome animations of the time course of each tournament game based on the data I pulled. These plots show the median Engaged Time for users from each country as each match progresses.

When you view these animations, you’ll likely notice that some of these countries have incredibly stable Engaged Times while others have Engaged Times that are incredibly erratic. The U.S., for instance shows a very small amount of variance in median Engaged Time, while Cote d’Ivoire and Cameroon have median Engaged Times that jump all over the place.

This behavior is a consequence of sample size. At any particular time during a match, users from many of the African countries and other smaller countries were a much smaller sample size than, say, users from the US or Australia. In statistics and data analysis, we’re always concerned about sample size for exactly the reason illustrated in many of these graphs. The variability in the sampled statistic can mask the “true” value. We can try to capture this with a distribution, but if the width of that distribution is large, then we can’t be very confident in the value of whatever measure of central tendency we choose (mean, median, mode, etc.). And sample variance depends on the inverse of the sample size, so only as the number of points we’ve sampled gets large do we have a hope that the confidence in our estimate will rise.

I’m actually quite surprised the U.S. made it so far in my scoring scheme here. I knew going into the #AWWC that some countries were sorely underrepresented in our sample. I expected a fair chance that these countries would show a falsely high median Engaged Time. If enough of the small sample of users just so happened to be long-engagement users, this would skew their results. In the Group Round this was okay, because I performed a statistical test that tried to account for this variability. There, I asked a very common statistical question: Assuming these two teams actually have the same median Engaged Time, what is the probability that I’d observe a difference in medians at least as extreme as the one I’ve observed? If that probability was low enough, then I declared Team A and Team B to have different medians, and took the higher one as the winner. But in the bracket round, we needed clear winners (no draws were allowed), so we left it up to sampling variance. For the small-sample-size teams, this was a double edged sword. They only needed a few users spending an inordinate time engaged with content to edge above the higher-sample-size teams. But, conversely, if the users they had spent very short times, that would skew towards losing. We can see, though, that this seemed to work out well for these counties—they made a great showing all the way through the AWWC.

Thinking about variability is my job, so I might be biased here (yes, a statistics pun), but I hope you enjoyed this fun exploration of our data. I hope it got you thinking about international variability in engagement, and variability of metrics in general. Tweet me @dpvalente or email me at dan@chartbeat if you want to continue the discussion.

Revisiting Return Rates

July 14th, 2014 by Josh

Starting today, we’ve updated our definition of return rate in both our Weekly Perspectives and in the Chartbeat Publishing dashboard. Consequently, you’re likely to see a shift in the numbers in your dashboard — so we wanted to write a quick note explaining the change, why we made it, and what you can expect to see.

Defining return rate

Return rate, if you’re not familiar with it, is a metric designed to capture the quality of traffic that typically comes from a referrer. It measures the fraction of visitors coming from a given referrer who return to a site later — if 1,000 people come to a site from, say, Facebook, should we expect 10 of them to come back or 500? Depending on the answer, we might interpret and respond to a spike from Facebook quite differently. While the intuition behind return rate is straightforward, the actual formula used to calculate it is a bit more up for grabs. Up until now, we’ve calculated return rates using the following formula: CodeCogsEqn (3) That formula roughly captures a notion of “how likely is it, for a given visit from Facebook, that that visit will be ‘converted’ into a return?”   As we’ve talked through that definition over the past year, we’ve come to realize that it’s more natural to phrase returns in terms of people, not visits — to ask “how likely is it, for a given visitor from Facebook, that that person will be ‘converted’ into a return?” Hence, we’re now using the following calculation: CodeCogsEqn (4) So, rather than speaking in units of “visits,” this definition speaks in units of “visitors” — a seemingly small (but significant) change. In addition, we’re now only counting a return if it’s at least an hour after the initial entrance, which corrects for a pattern we sometimes see where visitors enter a site and then re-enter a few minutes later.    

What's changing?

It’s likely that the return rate numbers in your dashboard and Weekly Perspectives will drop under this new definition. To help you sort out whether your numbers are trending up or down, we’ve gone back and recalculated reports using the new methodology, going back to the beginning of June. We hope that the transition to the new definition is painless, but if you have any questions, feel free to comment or get in touch with me at josh@chartbeat.com

Attention Web World Cup: Follow Along with Our Bracket for the Round of 16

June 27th, 2014 by Dan

AWWC

After two weeks of intense international web engagement, our bracket for the Attention Web World Cup is set. Many of the groups came down to the very last game, and if you’ve been following along, you witnessed the excitement of Honduras narrowly edging out Switzerland by one second and the USA keeping their two-second halftime lead to defeat Germany and advance to the knockout stage of the Attention Web World Cup.

The first round looks to have some really exciting matchups, if the scores from the Group Round any indicator. Nigeria, however, appears to be the clear favorite going into the next round.

There will be a few small rule changes for the Attention Web World Cup from here on out. First, there will be no draws allowed, so we’re throwing statistical significance out the door and determining the winner only by the team with the highest median Engaged Time. Second, although the teams in the #AWWC are different than in the real World Cup, each match will still be played at the same time as the corresponding match.

Keep on checking back for bracket updates and blog posts. Boa Sorte e divirta-se!

Attention Web World Cup: The Final Week of the Group Stage

June 23rd, 2014 by Dan

AWWC

If you're anything like me, you're still in a zombie-like stupor over the last 30 seconds of Sunday night's USA-Portugal game. It's astounding how drastically hopes, dreams, and expectations can change in a mere half minute.

Wait, we're talking about the Attention Web World Cup, right? Well, in that game, the U.S. was triumphant in defeating Portugal 25 - 22, knocking Portugal out of the next round.

For those of you anxiously anticipating which teams are moving on in the AWWC, or how your favorite country is faring, you can see all match results and standings in this sheet.

If you look at the results of each match, you'll likely notice a few interesting points. First, engagement for any particular country seems fairly consistent, although we only have two sample points per team to really make this a solid statement. More interestingly, the vast majority of countries have had very similar engagement times during their respective World Cup games—the exception being Nigeria, who has been totally smoking their opponents. (Their time differential through two matches is a whopping +16.8 seconds.)

Going into the AWWC, I wasn’t entirely sure if we’d see any drastic country-by-country differences in engagement. This is why I thought the Attention Web World Cup would be pretty interesting to play out. Any team could win any given match. We’d be at the whim of sampling variance, and we’d be able to hopefully gain some insight into international variability of Engaged Times.

There may not be any large noticeable differences across countries, and there are likely a lot of not-very-interesting reasons for this. In fact, I’m not entirely sure this is a good question to ask. I mean, why would there be drastic differences, right? But in discussions with fellow Data Science team members here at Chartbeat, we wondered that if there were any differences in how users from different countries engage with content, it may come from grouping by the language of the website rather than the country from which the user accesses a site. After all, we live in a global society, and perhaps people tend to gravitate towards content written in their primary language.

To get a feel for this, I looked at the Engaged Times I gathered for all games so far (as of Monday morning), but broken out by the language of the website. Below is what these distributions look like. The data are sorted by median Engaged Time.

engagement_by_language_2

A note about this plot:

This is a type of statistics visualization known as a box plot. Disclaimer: I’m a big fan of box plots. It gives us a feel for the distribution of data. The left end of each box represents the 25th percentile of the data, the right end of each box represents the 75th percentile, and the heavy black vertical line represents the median. For example, for Chinese language websites, the 25th percentile is 14 seconds, the median is 33 seconds, and the 75th percentile is 41 seconds. The thin lines at either end of the box—known as the whiskers—extend to the minimum and maximum data points measured.

We do see slight differences in engagement by language, but they are not overwhelming. The distributions overlap quite a bit. That said, there is a 15-second difference between the median Engaged Time of the largest median Engaged Time (Chinese language sites) and the smallest median Engaged Time (Arabic language sites). Western European languages tend to have nearly identical engagement. If there are any similarities, content written in Eastern languages like Chinese and Japanese do appear to gain engagement more than Western languages, but I'm no expert on languages, so I’m hesitant to conjecture further. And, let’s be honest here, I haven't done a rigorous analysis—it has so far just been a fun exploration.

On Friday, we’ll be announcing our lineups for the knock-out stage. As always, I’d love to hear your thoughts, so feel free to tweet them @dpvalente using the hashtag #AWWC.

On Engagement & Viewability: Why Quality Content Makes Good Business Sense

June 19th, 2014 by Josh

On March 31, the Media Rating Council (MRC) announced it was lifting its advisory on viewable impressions for display advertising, bringing the industry one step closer to transacting on viewability for the first time. The point at which publishers are asked to deliver highly viewable campaigns is rapidly approaching. If you haven’t started to develop a strategy to maximize the viewability of your ads, I’d wager that in the next three months, you will.

There are many tactics that can be applied to improve your ads' view ability: ensuring fast ad loads; lazy-loading advertisements; and redesigning a website to feature always in-view units.

One issue has gotten surprisingly little discussion, though: Ads are much more viewable on pages that people actually want to read. Take a look at the following figure, which was computed across a sample of a billion ad impressions across the month of May 2014.

Screen Shot 2014-06-19 at 12.00.22 PM

We see there’s a strong relationship between what fraction of ads are seen and how long a person spends reading the page: as Engaged Time increases from 15 seconds to one minute, viewability goes up by over half, from 37% to 57%. Visitors who read for more than 75 seconds see more than 60% of advertisements.

This isn’t too surprising. Of course, people who read pages more deeply see more of the ads on the page, but it’s still worth taking note. We’ve argued for years that articles with higher average Engaged Time should be promoted because they represent the articles your audience is most interested in, and—in the days where viewability is more critical than ever—promoting your most deeply read articles makes good business sense, too.


Want more? Download the Chartbeat Quarterly.
download-now