Many publishers would likely argue that the design of the website is as important for enticing readers to engage with the content as the content itself—humans, unfortunately, do judge books by their covers. The Guardian, The Atlantic, and The Wall Street Journal are just a few of the many publishers who have redesigned their websites this year.We wondered if we could use our data to give insight into just how important web design is—a concept we call “data-driven web design.” Are there aspects of a page’s design that correlate to increased traffic, and even better, increased engagement?Font sizes and colors, link sizes, link density, interaction, responsiveness: These are elements we can analyze for their ability to draw traffic to content and perhaps even contribute (along, of course, with the content itself) to keeping people there. Do people prefer to read articles surrounded by few links, large fonts, and bright colors? Or, are sparse, simple sites with undecorated text better? For those of us keen on data, could you use these attributes to predict how many people will be drawn to the content?Understanding how page elements relate to click-throughs is by no means a new idea. For as long as Google AdSense has been around, there have been all kinds of smart people who’ve tried to figure out just how ad size relates to clickthrough-rates (CTR). But ads and articles are very different beasts. Do the same rules that hold true for ads hold true for articles? Does link size matter? Is it the only thing? Are there even any rules at all?We here at Chartbeat like to focus on engagement, but as a first-pass, we wanted to examine how the almighty click-through relates to the size and distribution of links on a homepage. We examined a measure of click-through probability, the clicks per minute per active visitor (CPV). The data used in this analysis is the same which powers one of our most popular products, our Heads Up Display.We looked at data from 294 publishing sites during several different times of day across several days to sample a variety of conditions. Much of what we found is not surprising—that is, people click where the design guides them to click. For instance, the majority of clicks happen at page depths of 400 to 600 pixels, where most main content links are located (Figure 1). The other most probable places for clicks are the locations of menus on left and right sides of the page. Nothing surprising here. As far as link sizes go, intuition holds as well: One would expect larger links—which likely represent headline articles—to drive greater traffic. This is certainly true. As a link’s area grows, generally so does the clicks per active visitor (Figure 2). Heads Up Display. Think about link size, link density, and ask yourself what you can do to draw people into that fabulous content.
A few months ago, I wrote about coin-flipping experiments, Bayesian updating, and spurious certainty. Much of the discussion in that post centered on using probability densities to reason about data, but I put off detailed discussion about densities until a future time (and I passed the buck in a footnote, no less!). Well, the time has come to right that injustice. Over at our Engineering Blog, I’ve written up a bit of a tutorial on distributions and some different ways to visualize them. So, if this is something you are curious and/or interested in, head over there to read it. Enjoy!
The story goes like this:Sometime around 1935, the eminent statistician Ronald A. Fisher met a lady. The lady in question had a seemingly outrageous claim: She could tell, simply by taste, whether milk or tea was added to a cup first. But Fisher was skeptical, and, being the eminent statistician that he was, he developed a method by which to test her claim. He would present her with a series of eight cups; in half of those, milk would be placed in the cup first, and in the other half, tea would be placed in the cup first. Fisher would then hand the cups to the lady in random order and ask her which liquid was added first. If she performed better than chance, he’d believe her claim.Fisher’s description of this account was one of the first applications of hypothesis testing, perhaps the most widely used—and arguably one of the most important—statistical concepts of the modern era. That it was focused on a simple two-outcome choice is not surprising. Many processes we run into every day can be modeled in this way, as a coin flip. Will the subway show up on time? Will someone click the link to this article? Will the Detroit Tigers or the Kansas City Royals win this game? 1These kind of problems—those in which you only have two outcomes—are known in statistics as Bernoulli processes. The main parameter governing these phenomena is the probability that a trial has succeeded. In Fisher’s example, this is the probability that the lady correctly identifies whether milk or tea is added first. For web traffic, this is the probability of clicking a link. In many of these types of two-outcome problems, you want to know how likely it is that you’ll observe some number of successes in a given number of trials. For example, you may be interested in the probability that 50 people will click on a link if 100 people see it. If you make an assumption that each event (i.e., each click) is independent of the previous event, the probability that you see some number of successes can be described by the binomial distribution. With a firm understanding of Bernoulli processes and the binomial distribution, you are equipped for modeling a whole host of binary-outcome problems.
Is this a fair coin?
A binomial distribution, however, isn’t incredibly useful if we don’t know the probability of success for a single trial. Honestly, this is what we’re typically interested in finding out, and it is really what Fisher tested: He assumed the probability of a lady guessing whether milk or tea was added first was pure chance (50/50), and developed a test to see if the data were consistent with the results of the experiment. But, in general, how do we determine what this probability is?There are two ways we can estimate the probability from a set of trials. We could simply count the number of successes we’ve had and divide by the total number of trials. For instance, if we flipped a coin 10 times and it came up heads 3 of those times, we might guess that the coin is an unfair coin, landing on its head only 30% of the time. This is all well and good, but we only flipped the coin 10 times. How certain are we that the probability is actually 0.3? Perhaps it truly is a fair coin and our sample size was just too small.Alternatively, we could assume that our probability of interest itself has some distribution. That is, perhaps we think that the probability is about 0.3, but we concede that it could be 0.1 or 0.5 or even 0.99999. Treating our parameter as a distribution is the heart of a technique known as Bayesian inference, which is based upon Bayes rule:
Don’t be intimidated by this equation—it is actually fairly intuitive. The left-hand side represents the answer to the question: Given the data we observed, how certain are we that our quantity-of-interest takes on a given value? This is called the posterior distribution. The right-hand side contains information about what we believe about the process we’re interested in. Prob(quantity-of-interest) is known as the prior distribution. This describes our initial beliefs about the quantity we’re trying to find out about; in this case, our probability of success in the Bernoulli trial. Prob(observation | quantity-of-interest) is called the likelihood. The likelihood describes what we believe the distribution of the data to be if we assume our quantity is a specific value. In our click-through/coin-flipping example, this is simply the binomial distribution. If we know the fairness of the coin p, then the probability we get M successes out of N flips follows a binomial distribution with parameters M and N. Then, a simple multiplication of our prior and our likelihood gives us our posterior. 2The above equation may not seem very impressive, but the real power of the Bayesian method comes in when we iteratively apply the equation to update our beliefs. That is, we can use a previously calculated posterior as a prior in a next round of calculation to update our posterior. If we do this enough times, we hope to decrease our uncertainty enough so that we can confidently determine what our “true” probability is. The neat thing is that if we choose our prior intelligently, we can get the math to work out so that updates are fairly easy.That’s the math, but here is a concrete example. Consider an example website. Suppose we’re interested in the probability that a person will click on some link. If 10 visitors come to the page, and three of those people click on the link, we might guess that the click-through probability for that link is 3 /10 = 0.3 , but we wouldn’t be very certain; we only flipped the coin a small number of times. The far left panel on the figure below shows a prior we might build based on that uncertainty. It is peaked near 0.3, but is quite wide. 3
The subtleties of assumption
Pretty amazing, right? If we gather data long enough, we can be incredibly certain about our click-through probability. In many cases, this is true. But let’s back up a bit.In considering Bernoulli processes there is a fundamental underlying assumption that can often be overlooked. The assumption is this: The probability of success, p, is constant from trial to trial. For most phenomena, this is a reasonable assumption. But what if it is not? If the probability varies from trial to trial and this isn’t accounted for in our Bayesian updating, then we can end up becoming very certain about an incorrect probability. Consider the following example, where our probability varies smoothly between 0.3 and 0.6 over the course of 1,000 trials.
Using the data as a guide
How exactly can we take into consideration this time variation? We could add time directly into our Bayesian updates, but to get good data we might have to wait a long time. After all, in the general case we don’t really know what this variation looks like. Does our probability vary by time of day? Day of week? Month? Year? All of these? In reality, we probably don’t have enough time to gather enough data for our Bayesian updating to be very informative.An alternative way is to forget about doing any sort of modeling and simply use measurements. In this method, we forget about priors and posteriors and likelihoods and just make a histogram of the data we’ve measured. We could, in effect, build an empirical form of the distributions from the figures above. Then we can update our beliefs by averaging the histogram of old data with the histogram of new data; we can even use a weighted average so anomalies will get “smoothed out.” We may not get a nice thin distribution, but at least we capture some of this temporal variation and we avoid spurious certainty. In fact, we’ve built our Heads Up Display, which measures click-through probabilities, to do exactly this.
The Tao of Data
In my opinion, we—and by we I mean humanity—should be ever the skeptics. In data analysis, this is paramount. Like Fisher, we should question outrageous claims and find ways to test them. We should revisit old assumptions, test them, and revisit them again. The data will guide the way, but we should always beware of spurious certainty.Or, you know, you could always just flip a coin.
1. The Tigers. Always bet on the Tigers.↩2. Ignoring, of course, the fraction’s denominator, but that is a bit out of the scope of this post… which is math speak for laziness.↩3. A note about how to read probability density functions if you are not familiar with them: Probability density functions (PDFs) are truly probability densities; that is, the area under the curve between two values on the x-axis gives the probability that our quantity-of-interest will be between those two points. That’s why the y-axis is so funny. To get the probability, we essentially need to multiply the y-axis value by distance between two values on the x-axis. If that doesn’t make any sense, just know that the best way to think about these distributions is to see where the curve is the most dense—where the greatest area under the curve is. So, the places where the peaks exist are the most probable values. I’ll blog more about distributions in the near future.↩
After a month of exciting matches, the Attention Web World Cup has come to a close. In a time-honored tradition (pun intended) Ghana defeated the US with a score of 30 to 25. Congratulations to everyone from Ghana who was consuming content on the web during World Cup matches; you all contributed to this amazing achievement! And to my fellow Americans: next time around, let’s spend more time reading, okay?To wrap up the festivities, one of our designers made these awesome animations of the time course of each tournament game based on the data I pulled. These plots show the median Engaged Time for users from each country as each match progresses.When you view these animations, you’ll likely notice that some of these countries have incredibly stable Engaged Times while others have Engaged Times that are incredibly erratic. The U.S., for instance shows a very small amount of variance in median Engaged Time, while Cote d’Ivoire and Cameroon have median Engaged Times that jump all over the place.This behavior is a consequence of sample size. At any particular time during a match, users from many of the African countries and other smaller countries were a much smaller sample size than, say, users from the US or Australia. In statistics and data analysis, we’re always concerned about sample size for exactly the reason illustrated in many of these graphs. The variability in the sampled statistic can mask the “true” value. We can try to capture this with a distribution, but if the width of that distribution is large, then we can’t be very confident in the value of whatever measure of central tendency we choose (mean, median, mode, etc.). And sample variance depends on the inverse of the sample size, so only as the number of points we’ve sampled gets large do we have a hope that the confidence in our estimate will rise.I’m actually quite surprised the U.S. made it so far in my scoring scheme here. I knew going into the #AWWC that some countries were sorely underrepresented in our sample. I expected a fair chance that these countries would show a falsely high median Engaged Time. If enough of the small sample of users just so happened to be long-engagement users, this would skew their results. In the Group Round this was okay, because I performed a statistical test that tried to account for this variability. There, I asked a very common statistical question: Assuming these two teams actually have the same median Engaged Time, what is the probability that I’d observe a difference in medians at least as extreme as the one I’ve observed? If that probability was low enough, then I declared Team A and Team B to have different medians, and took the higher one as the winner. But in the bracket round, we needed clear winners (no draws were allowed), so we left it up to sampling variance. For the small-sample-size teams, this was a double edged sword. They only needed a few users spending an inordinate time engaged with content to edge above the higher-sample-size teams. But, conversely, if the users they had spent very short times, that would skew towards losing. We can see, though, that this seemed to work out well for these counties—they made a great showing all the way through the AWWC.Thinking about variability is my job, so I might be biased here (yes, a statistics pun), but I hope you enjoyed this fun exploration of our data. I hope it got you thinking about international variability in engagement, and variability of metrics in general. Tweet me @dpvalente or email me at dan@chartbeat if you want to continue the discussion.