Archive for the ‘Data Science’ Category

Regardless of your newsroom’s size or how many articles you publish every day, chances are you’ve got a Twitter account.

What’s more, you’ve likely tried, to greater or lesser success, to leverage the social network for the distribution and promotion of your content. But once your thought-provoking, 140-characters-or-less message is dispatched, what happens next? Will the time and effort you spent pitching your editor pay off? Will you draw in readers who will actively engage with the content? Will you manage to convince readers to explore additional articles? Could you even convince users to come back over and over again? Or, is all that effort lost in the Twitterverse, drawing in a few readers who come and leave, never to be seen again?

These are just a few of the questions we’ve been trying to answer here at Chartbeat. But rather than placing all visitors who come from Twitter into a single class and making the assumption that they all behave the same way, we decided to take a deeper dive with an eye toward nuance. We examined the behavior of readers who come from tweets published by content owners (first parties) versus those coming from independent agents (third parties).

To test our assumption and measure different forms of engagement, we decided on four metrics: average Engaged Time; average number of other pages a viewer visits on a site within two hours of their first visit; percent of users who return after initially visiting; and of those users who return, the number of times on average they will return in the next 30 days. Roughly broken down, this gives us two metrics to look at short-term reader value (engagement and redirects), and two to look at long-term reader value (percent retained and rate of user return).

From previous experience and assumptions we made, it seemed to us that readers coming directly from a content owners’ tweet would probably already be a member of that publisher’s loyal audience. It would therefore seem logical that these users show qualities similar to that of loyal readers—chiefly that they exhibit a higher than average return rate, and read more pages when visiting.

So it wasn’t surprising that when observing the percent of users that came back, readers from first-party sources showed returning rates about 15% higher than readers from third parties. During their initial visit, readers coming from Twitter also tend to stick around longer, with first-party consumers reading on average three pages during a visit, compared to non-social traffic’s one page. The difference between first-party and third-party social consumers, however, does not differ significantly.


The number of times returning users came back however was surprising. Of those users who came back at all, users coming from a first party returned on average 8 to 10 times. A similar user, though, who came from a third party came back 11 to 13 times. This may suggest that after passing that retention barrier and convincing a reader to come back, the users you receive turn out to be much more valuable, as they return more often, and help bolster your current loyal population.


When looking at the time a reader engaged with a page, we found that readers from third parties actually engaged with content significantly longer than first-party readers. While first-party consumers engaged with content about the same amount as any other user, regardless of where they came from (averaging between 37 and 39 seconds), third-party readers engaged on average between 42 to 45 seconds. (Calculated with a p-value of p < 0.01) These differences, while seeming small, can lead to practical differences in engagement from a few seconds to nearly a 40% difference in Engaged Time.


Though there are many reasons that these differences may be occurring, one possible conclusion lies in what attracts readers to engage with content. Users who follow publishers on Twitter are apt to know more about the publisher’s content; consequently having a greater sense of what type of content they want to read. Loyal readers, as opposed to new readers, may therefore be skimming through content, knowing they will come back later for follow-up stories, or to learn more. Non-loyal readers, however, who are generally the readers coming from third-party tweets, come due to the referral of a friend. These readers may engage deeply with content due to the personal connection with the recommender of that content.

So, to answer the initial questions of whether your tweets actually matter to the health of your site: Of course they do—you already knew that. Your tweets are vital to your loyal audience, and bring in readers who return more often and consume higher quantities of content than readers coming from anywhere else. Don’t forget about the importance of making content people want to tweet about, though! Because it turns out people actually listen to the recommendations of their friends, deeply engage in the content before them, and if enamored enough to return in the future, turn into fiercely loyal members of your site’s virtual population.

Note: This post was co-authored by Kris Harbold and Andy Chen.

By looking at the amount of time visitors are exposed to ads on different parts of a web page, we can get a sense of how much value your ad inventory generates for advertisers. And the results might not be what you expect.

It turns out, traditional advertising heuristics about which parts of the page are most valuable are wrong. Let’s look at figures 1 and 2, which show us the relationship between where the middle of an ad is positioned and how it’s viewed. We look at how ads are viewed through two metrics: viewability and average active exposure time. An ad is viewable if at least 50% of the area is in view for at least one continuous second. Of those visitors that have the opportunity to view the ad, average active exposure time is how many seconds they’ve spent actively browsing the page while that ad is in view.



As we might expect, both viewability and average active exposure time eventually trend downward. But this broad downward trend isn’t without a few blips. In particular, let’s consider the conspicuous dip in both viewability and average active exposure in the topmost 500 pixels of the page. Notice that viewability and average active exposure drop to a level that isn’t reached until 1,500 and 2,500 pixels down, respectively. This can be explained by the fact that visitors tend to start scrolling down the page right after they arrive, sometimes even before the page fully loads. Then, visitors settle on engaging content further down the page, which results in the subsequent increase in our metrics.

So what does this mean for your inventory? We know that the value of an ad position depends not only on its viewability but also on average active exposure. It follows that the most valuable ad real estate isn’t at the very top of the page, but rather just below the top, where both our metrics peak out. And while both metrics decrease below the fold, we see that average active exposure is remarkably resilient: a visitor that sees an ad 2,500 pixels below the top will still spend an average of 10 seconds of engaged time with the ad in view. That’s plenty of time for the ad to have an effect on the viewer.


When we look at data from across the Chartbeat Universe, one thing that stands out is the difference in how people from around the world engage with content—how long they spend reading, how far they scroll down the page, and which devices they use. Here’s a broad look at how visitors from different regions of the world spend their time and give their attention.


Many publishers would likely argue that the design of the website is as important for enticing readers to engage with the content as the content itself—humans, unfortunately, do judge books by their covers. The Guardian, The Atlantic, and The Wall Street Journal are just a few of the many publishers who have redesigned their websites this year.

We wondered if we could use our data to give insight into just how important web design is—a concept we call “data-driven web design.” Are there aspects of a page’s design that correlate to increased traffic, and even better, increased engagement?

Font sizes and colors, link sizes, link density, interaction, responsiveness: These are elements we can analyze for their ability to draw traffic to content and perhaps even contribute (along, of course, with the content itself) to keeping people there. Do people prefer to read articles surrounded by few links, large fonts, and bright colors? Or, are sparse, simple sites with undecorated text better? For those of us keen on data, could you use these attributes to predict how many people will be drawn to the content?

Understanding how page elements relate to click-throughs is by no means a new idea. For as long as Google AdSense has been around, there have been all kinds of smart people who’ve tried to figure out just how ad size relates to clickthrough-rates (CTR). But ads and articles are very different beasts. Do the same rules that hold true for ads hold true for articles? Does link size matter? Is it the only thing? Are there even any rules at all?

We here at Chartbeat like to focus on engagement, but as a first-pass, we wanted to examine how the almighty click-through relates to the size and distribution of links on a homepage. We examined a measure of click-through probability, the clicks per minute per active visitor (CPV). The data used in this analysis is the same which powers one of our most popular products, our Heads Up Display.

We looked at data from 294 publishing sites during several different times of day across several days to sample a variety of conditions. Much of what we found is not surprising—that is, people click where the design guides them to click. For instance, the majority of clicks happen at page depths of 400 to 600 pixels, where most main content links are located (Figure 1). The other most probable places for clicks are the locations of menus on left and right sides of the page. Nothing surprising here. As far as link sizes go, intuition holds as well: One would expect larger links—which likely represent headline articles—to drive greater traffic. This is certainly true. As a link’s area grows, generally so does the clicks per active visitor (Figure 2).



Larger links correlate with higher click-throughs, but what about link density? For sites with a lot of closely packed links, does this dilute click-through rates? After all, there are only so many concurrent users to split across content. As a proxy for density, we looked at the median distance between links on a site. The data shows that CPVs decrease approximately linearly for links a distance of 450 pixels apart to about 2,000 pixels apart. Sites having more closely spaced links perform about two and a half times better than sites with distant links. It seems users prefer denser sites (Figure 3).


These two pieces of evidence seem to contradict each other, though, because the distance between large links is necessarily large (assuming, of course, the links aren’t nested!). You might think, “Wait… if I have a lot of large links, I’ll have huge CPV, but they will be spaced far apart, so I’ll have a small CPV!” But, in reality, the data is only reflecting a common website design principle—a few large links interspersed with many smaller, closely spaced links.

In fact, if you ponder these data long enough, it seems that we run into a chicken-and-egg problem. Click-throughs force a tautology. Design forces people to click in certain places, so they do. And we measure this. See why engagement matters?

In any case, the data back up our intuition when it comes to determining how many people will click through to a given piece of content. Given a large enough dataset in which you know where a link is on a page, its height and width, how many people are on the page, and how many are currently engaged with content, you could likely obtain a reasonable prediction for the CPV. And perhaps using this knowledge, one might use such a model to guide the redesign of a website.

We decided to try this (not the site redesign part, the modeling part!). Simple statistical models we have recently built can predict CPV for a link to within 0.007 clicks per min per active visitor for 92% of links. This might seem impressive, but to get a foundation for what this means, only four websites in the set we analyzed have a median CPV greater than this. There is much more work to do until we can really answer the question if design can predict attraction to and engagement with content, but the way forward is promising. Colors, font sizes, responsiveness—the design space is large. These can draw people in, but ultimately, it is the content that will keep people there.

So, the next time you are thinking of undergoing an overhaul or redesign, stare closely at your Heads Up Display. Think about link size, link density, and ask yourself what you can do to draw people into that fabulous content.

Over the past few years, Internet traffic has seen major changes. As smartphones become more ubiquitous, more and more people are spending a significant amount of time on the web on mobile devices, and in particular, via mobile applications. In October, more than half of the time Internet users spent online was via mobile and tablet applications.

With the rise in mobile application traffic, there has been a parallel increase in unattributed traffic to articles on the web—a bucket of traffic referred to as dark social. This category of traffic encompasses not only the visitors who enter a URL directly, but also those who click on links from email, instant messaging, and many mobile and desktop applications. Unattributed traffic can also result from a number of technical issues that cause referrer information to be omitted from a known traffic source. The lack of clear attribution for this traffic is a big problem: for most domains on our network, dark social accounts for 20% to 40% of overall external traffic to articles. Because of the popularity of mobile applications, the percentage of dark social traffic among mobile users is even higher.

Fortunately, the problem of dark social is becoming more widely acknowledged throughout the industry. Individual domains have long tried to manually alleviate the problem by including tracking tags and custom URLs on their social content, but are increasingly looking for additional tools to confront the problem head on. Analytics providers continue to refine their offerings and take a leading role in driving the conversation. Major referrer sources are doing more to ensure that their traffic is properly acknowledged. We’ll take a look at some of these developments.

One way of getting a handle on this attribution problem is to look carefully at traffic patterns among the articles on your site. For a large majority of the articles we have looked at, dark social traffic closely correlates in time with other attribution sources. For instance, several of the most popular mobile applications for Reddit do not pass referrer information. Consequentially, when we see spikes in Reddit-based traffic on desktop, we tend to see a corresponding spike of dark social traffic on mobile. This suggests that a large portion of dark social traffic is really just misattribution of known referrers. As a result, for individual articles, you can explicitly attribute much of this traffic to the correct sources.

Chartbeat is now leveraging user agent profiles to disambiguate a significant chunk of dark social mobile application traffic. Many major mobile applications such as Facebook, Twitter, Pinterest, Drudge Report, and Flipboard set a tag in the user agent to identify the application. For example, in the following user agent, the tag “[FBAN/FBIOS…]” identifies the use of the Facebook application on iOS:

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12B440 [FBAN/ FBIOS;FBAV/;FBBV/6017145;FBDV/ iPhone7,2;FBMD/iPhone;FBSN/iPhone OS;FBSV/8.1.2;FBSS/2; FBCR/AT&T;FBID/phone;FBLC/ en_US;FBOP/5]

In many cases, we saw an immediate difference after Chartbeat started capturing missing referrers for these user agent-tagged mobile applications. For instance, we saw the traffic attributed to mobile Facebook use jump as much as 40% from previously misattributed dark social traffic.

Several large sites have also made recent efforts to try to pass along referrer information more of the time. In early 2014, Yahoo made a sitewide conversion to use HTTPS instead of HTTP by default, causing referrer data to be dropped. Recently, however, we have observed changes from the Yahoo site that now allow the referrer to be passed for both Yahoo Search and Yahoo News. Facebook also recently announced that it fixed a bug that was causing referrer data to get lost on outgoing mobile application clicks. This fix is particularly notable because of how much traffic originates from the social network.

We can see the results of these changes across our network. Figure 1 shows how the share of dark social traffic has evolved over the second half of 2014. While dark social on desktop is relatively stable, we can see a significant drop in dark social for both mobile and tablet devices in November, concurrent with the Facebook fix. (We also see a corresponding rise in Facebook traffic.)


As more sites pay closer attention to the analytics needs of its publishers and as more mobile applications pass referrer information or user agent identification, perhaps we can make further inroads into the problem of missing attribution. Still, even with the most recent efforts, dark social share remains at a third of external traffic. We still see close time series correlations for major drivers of traffic such as Facebook and Reddit. It is apparent that we’ve made strong progress in mitigating dark social traffic on mobile and tablet devices; but as a share of traffic, dark social on mobile is still significantly higher than dark social on desktop. Unfortunately, we can’t give up on tracking codes and custom URLs quite yet.