The story goes like this:
Sometime around 1935, the eminent statistician Ronald A. Fisher met a lady. The lady in question had a seemingly outrageous claim: She could tell, simply by taste, whether milk or tea was added to a cup first. But Fisher was skeptical, and, being the eminent statistician that he was, he developed a method by which to test her claim. He would present her with a series of eight cups; in half of those, milk would be placed in the cup first, and in the other half, tea would be placed in the cup first. Fisher would then hand the cups to the lady in random order and ask her which liquid was added first. If she performed better than chance, he'd believe her claim.
Fisher’s description of this account was one of the first applications of hypothesis testing, perhaps the most widely used—and arguably one of the most important—statistical concepts of the modern era. That it was focused on a simple two-outcome choice is not surprising. Many processes we run into every day can be modeled in this way, as a coin flip. Will the subway show up on time? Will someone click the link to this article? Will the Detroit Tigers or the Kansas City Royals win this game? 1
These kind of problems—those in which you only have two outcomes—are known in statistics as Bernoulli processes. The main parameter governing these phenomena is the probability that a trial has succeeded. In Fisher’s example, this is the probability that the lady correctly identifies whether milk or tea is added first. For web traffic, this is the probability of clicking a link. In many of these types of two-outcome problems, you want to know how likely it is that you’ll observe some number of successes in a given number of trials. For example, you may be interested in the probability that 50 people will click on a link if 100 people see it. If you make an assumption that each event (i.e., each click) is independent of the previous event, the probability that you see some number of successes can be described by the binomial distribution. With a firm understanding of Bernoulli processes and the binomial distribution, you are equipped for modeling a whole host of binary-outcome problems.
Is this a fair coin?
A binomial distribution, however, isn’t incredibly useful if we don’t know the probability of success for a single trial. Honestly, this is what we’re typically interested in finding out, and it is really what Fisher tested: He assumed the probability of a lady guessing whether milk or tea was added first was pure chance (50/50), and developed a test to see if the data were consistent with the results of the experiment. But, in general, how do we determine what this probability is?
There are two ways we can estimate the probability from a set of trials. We could simply count the number of successes we’ve had and divide by the total number of trials. For instance, if we flipped a coin 10 times and it came up heads 3 of those times, we might guess that the coin is an unfair coin, landing on its head only 30% of the time. This is all well and good, but we only flipped the coin 10 times. How certain are we that the probability is actually 0.3? Perhaps it truly is a fair coin and our sample size was just too small.
Alternatively, we could assume that our probability of interest itself has some distribution. That is, perhaps we think that the probability is about 0.3, but we concede that it could be 0.1 or 0.5 or even 0.99999. Treating our parameter as a distribution is the heart of a technique known as Bayesian inference, which is based upon Bayes rule:
Don’t be intimidated by this equation—it is actually fairly intuitive. The left-hand side represents the answer to the question: Given the data we observed, how certain are we that our quantity-of-interest takes on a given value? This is called the posterior distribution. The right-hand side contains information about what we believe about the process we’re interested in. Prob(quantity-of-interest) is known as the prior distribution. This describes our initial beliefs about the quantity we’re trying to find out about; in this case, our probability of success in the Bernoulli trial. Prob(observation | quantity-of-interest) is called the likelihood. The likelihood describes what we believe the distribution of the data to be if we assume our quantity is a specific value. In our click-through/coin-flipping example, this is simply the binomial distribution. If we know the fairness of the coin p, then the probability we get M successes out of N flips follows a binomial distribution with parameters M and N. Then, a simple multiplication of our prior and our likelihood gives us our posterior. 2
The above equation may not seem very impressive, but the real power of the Bayesian method comes in when we iteratively apply the equation to update our beliefs. That is, we can use a previously calculated posterior as a prior in a next round of calculation to update our posterior. If we do this enough times, we hope to decrease our uncertainty enough so that we can confidently determine what our “true” probability is. The neat thing is that if we choose our prior intelligently, we can get the math to work out so that updates are fairly easy.
That’s the math, but here is a concrete example. Consider an example website. Suppose we’re interested in the probability that a person will click on some link. If 10 visitors come to the page, and three of those people click on the link, we might guess that the click-through probability for that link is 3 /10 = 0.3 , but we wouldn’t be very certain; we only flipped the coin a small number of times. The far left panel on the figure below shows a prior we might build based on that uncertainty. It is peaked near 0.3, but is quite wide. 3
Now suppose that we’ve waited long enough for many, many visitors. The two subsequent panels show how the distribution evolves as we gather more data. When we’ve seen 1000 visitors, we are pretty darn certain that the click-through probability is somewhere very close to 0.3. Now imagine what happens when we’ve seen 10,000 visitors!
The subtleties of assumption
Pretty amazing, right? If we gather data long enough, we can be incredibly certain about our click-through probability. In many cases, this is true. But let’s back up a bit.
In considering Bernoulli processes there is a fundamental underlying assumption that can often be overlooked. The assumption is this: The probability of success, p, is constant from trial to trial. For most phenomena, this is a reasonable assumption. But what if it is not? If the probability varies from trial to trial and this isn’t accounted for in our Bayesian updating, then we can end up becoming very certain about an incorrect probability. Consider the following example, where our probability varies smoothly between 0.3 and 0.6 over the course of 1,000 trials.
What happens when we do Bayesian updating with the same assumptions as above?
Not only does the peak of our posterior jump around wildly, depending on how many trials we do, but we start becoming incredibly certain that the probability is near the dead center of our varying probability function. I like to call this spurious certainty. We have an inaccurate model and too much data! We have become too certain in our beliefs.
This may seem like a contrived case, but in actuality, it is not. In fact, we’ve seen data here at Chartbeat to suggest that the probability to click on a link is time dependent. Take the figure below, which shows the average click probability for all links on an anonymous site’s homepage on a particular day.
The probability shows a 70% decrease from the beginning of the day to around 2 p.m., and then back up. In order to accurately depict the click-through behavior of this site's users, we have to take this variation into account to avoid spurious certainty.
Using the data as a guide
How exactly can we take into consideration this time variation? We could add time directly into our Bayesian updates, but to get good data we might have to wait a long time. After all, in the general case we don’t really know what this variation looks like. Does our probability vary by time of day? Day of week? Month? Year? All of these? In reality, we probably don’t have enough time to gather enough data for our Bayesian updating to be very informative.
An alternative way is to forget about doing any sort of modeling and simply use measurements. In this method, we forget about priors and posteriors and likelihoods and just make a histogram of the data we’ve measured. We could, in effect, build an empirical form of the distributions from the figures above. Then we can update our beliefs by averaging the histogram of old data with the histogram of new data; we can even use a weighted average so anomalies will get “smoothed out.” We may not get a nice thin distribution, but at least we capture some of this temporal variation and we avoid spurious certainty. In fact, we've built our Heads Up Display, which measures click-through probabilities, to do exactly this.
The Tao of Data
In my opinion, we—and by we I mean humanity—should be ever the skeptics. In data analysis, this is paramount. Like Fisher, we should question outrageous claims and find ways to test them. We should revisit old assumptions, test them, and revisit them again. The data will guide the way, but we should always beware of spurious certainty.
Or, you know, you could always just flip a coin.
1. The Tigers. Always bet on the Tigers.↩
2. Ignoring, of course, the fraction’s denominator, but that is a bit out of the scope of this post... which is math speak for laziness.↩
3. A note about how to read probability density functions if you are not familiar with them: Probability density functions (PDFs) are truly probability densities; that is, the area under the curve between two values on the x-axis gives the probability that our quantity-of-interest will be between those two points. That’s why the y-axis is so funny. To get the probability, we essentially need to multiply the y-axis value by distance between two values on the x-axis. If that doesn’t make any sense, just know that the best way to think about these distributions is to see where the curve is the most dense—where the greatest area under the curve is. So, the places where the peaks exist are the most probable values. I’ll blog more about distributions in the near future.↩
After a month of exciting matches, the Attention Web World Cup has come to a close. In a time-honored tradition (pun intended) Ghana defeated the US with a score of 30 to 25. Congratulations to everyone from Ghana who was consuming content on the web during World Cup matches; you all contributed to this amazing achievement! And to my fellow Americans: next time around, let’s spend more time reading, okay?
To wrap up the festivities, one of our designers made these awesome animations of the time course of each tournament game based on the data I pulled. These plots show the median Engaged Time for users from each country as each match progresses.
When you view these animations, you’ll likely notice that some of these countries have incredibly stable Engaged Times while others have Engaged Times that are incredibly erratic. The U.S., for instance shows a very small amount of variance in median Engaged Time, while Cote d’Ivoire and Cameroon have median Engaged Times that jump all over the place.
This behavior is a consequence of sample size. At any particular time during a match, users from many of the African countries and other smaller countries were a much smaller sample size than, say, users from the US or Australia. In statistics and data analysis, we’re always concerned about sample size for exactly the reason illustrated in many of these graphs. The variability in the sampled statistic can mask the “true” value. We can try to capture this with a distribution, but if the width of that distribution is large, then we can’t be very confident in the value of whatever measure of central tendency we choose (mean, median, mode, etc.). And sample variance depends on the inverse of the sample size, so only as the number of points we’ve sampled gets large do we have a hope that the confidence in our estimate will rise.
I’m actually quite surprised the U.S. made it so far in my scoring scheme here. I knew going into the #AWWC that some countries were sorely underrepresented in our sample. I expected a fair chance that these countries would show a falsely high median Engaged Time. If enough of the small sample of users just so happened to be long-engagement users, this would skew their results. In the Group Round this was okay, because I performed a statistical test that tried to account for this variability. There, I asked a very common statistical question: Assuming these two teams actually have the same median Engaged Time, what is the probability that I’d observe a difference in medians at least as extreme as the one I’ve observed? If that probability was low enough, then I declared Team A and Team B to have different medians, and took the higher one as the winner. But in the bracket round, we needed clear winners (no draws were allowed), so we left it up to sampling variance. For the small-sample-size teams, this was a double edged sword. They only needed a few users spending an inordinate time engaged with content to edge above the higher-sample-size teams. But, conversely, if the users they had spent very short times, that would skew towards losing. We can see, though, that this seemed to work out well for these counties—they made a great showing all the way through the AWWC.
Thinking about variability is my job, so I might be biased here (yes, a statistics pun), but I hope you enjoyed this fun exploration of our data. I hope it got you thinking about international variability in engagement, and variability of metrics in general. Tweet me @dpvalente or email me at dan@chartbeat if you want to continue the discussion.
After two weeks of intense international web engagement, our bracket for the Attention Web World Cup is set. Many of the groups came down to the very last game, and if you’ve been following along, you witnessed the excitement of Honduras narrowly edging out Switzerland by one second and the USA keeping their two-second halftime lead to defeat Germany and advance to the knockout stage of the Attention Web World Cup.
The first round looks to have some really exciting matchups, if the scores from the Group Round any indicator. Nigeria, however, appears to be the clear favorite going into the next round.
There will be a few small rule changes for the Attention Web World Cup from here on out. First, there will be no draws allowed, so we’re throwing statistical significance out the door and determining the winner only by the team with the highest median Engaged Time. Second, although the teams in the #AWWC are different than in the real World Cup, each match will still be played at the same time as the corresponding match.
Keep on checking back for bracket updates and blog posts. Boa Sorte e divirta-se!
If you're anything like me, you're still in a zombie-like stupor over the last 30 seconds of Sunday night's USA-Portugal game. It's astounding how drastically hopes, dreams, and expectations can change in a mere half minute.
Wait, we're talking about the Attention Web World Cup, right? Well, in that game, the U.S. was triumphant in defeating Portugal 25 - 22, knocking Portugal out of the next round.
For those of you anxiously anticipating which teams are moving on in the AWWC, or how your favorite country is faring, you can see all match results and standings in this sheet.
If you look at the results of each match, you'll likely notice a few interesting points. First, engagement for any particular country seems fairly consistent, although we only have two sample points per team to really make this a solid statement. More interestingly, the vast majority of countries have had very similar engagement times during their respective World Cup games—the exception being Nigeria, who has been totally smoking their opponents. (Their time differential through two matches is a whopping +16.8 seconds.)
Going into the AWWC, I wasn’t entirely sure if we’d see any drastic country-by-country differences in engagement. This is why I thought the Attention Web World Cup would be pretty interesting to play out. Any team could win any given match. We’d be at the whim of sampling variance, and we’d be able to hopefully gain some insight into international variability of Engaged Times.
There may not be any large noticeable differences across countries, and there are likely a lot of not-very-interesting reasons for this. In fact, I’m not entirely sure this is a good question to ask. I mean, why would there be drastic differences, right? But in discussions with fellow Data Science team members here at Chartbeat, we wondered that if there were any differences in how users from different countries engage with content, it may come from grouping by the language of the website rather than the country from which the user accesses a site. After all, we live in a global society, and perhaps people tend to gravitate towards content written in their primary language.
To get a feel for this, I looked at the Engaged Times I gathered for all games so far (as of Monday morning), but broken out by the language of the website. Below is what these distributions look like. The data are sorted by median Engaged Time.
A note about this plot:
This is a type of statistics visualization known as a box plot. Disclaimer: I’m a big fan of box plots. It gives us a feel for the distribution of data. The left end of each box represents the 25th percentile of the data, the right end of each box represents the 75th percentile, and the heavy black vertical line represents the median. For example, for Chinese language websites, the 25th percentile is 14 seconds, the median is 33 seconds, and the 75th percentile is 41 seconds. The thin lines at either end of the box—known as the whiskers—extend to the minimum and maximum data points measured.
We do see slight differences in engagement by language, but they are not overwhelming. The distributions overlap quite a bit. That said, there is a 15-second difference between the median Engaged Time of the largest median Engaged Time (Chinese language sites) and the smallest median Engaged Time (Arabic language sites). Western European languages tend to have nearly identical engagement. If there are any similarities, content written in Eastern languages like Chinese and Japanese do appear to gain engagement more than Western languages, but I'm no expert on languages, so I’m hesitant to conjecture further. And, let’s be honest here, I haven't done a rigorous analysis—it has so far just been a fun exploration.
On Friday, we’ll be announcing our lineups for the knock-out stage. As always, I’d love to hear your thoughts, so feel free to tweet them @dpvalente using the hashtag #AWWC.