The State of Dark Social

Here at Chartbeat, we have a long history of trying to shed light on the sources of your traffic. Since 2012, we’ve helped illuminate the phenomenon known as dark social—where traffic is likely to come from social sources, yet lacks explicit referrer attribution. Two years later, Internet traffic looks a lot different than it previously did. Mobile and app traffic have grown significantly.

More sites are moving to HTTPS. Usage patterns are evolving. We wanted to take the opportunity to look into the current state of dark social and dive deeply not only into potential causes, but also potential disambiguations of this nebulous block of traffic. As a result of these investigations, we found a way to attribute a sizable chunk of dark social (up to half!) to application traffic.

So what exactly is dark social? Here’s a brief recap. Back in the Wild West of web analytics, we tagged any traffic coming in without a referrer field as direct traffic. Many people have attempted to exhaustively list what might cause an empty referrer field, but it the typical explanation of a visitor typing in the URL directly was unsatisfying for article content. The alternate explanation that these visitors came from IMs, emails, or apps seemed much more likely, and so we categorized them as social instead. This social traffic came to be known as “dark social” and has made regular appearances at the top of referrer lists ever since.

Dark Social Volume

These days, dark social accounts for about a third of external traffic to sites across our network. The exact amount varies quite a bit depending on the particular site in question, but most sites have a chunk ranging from significant to extremely significant. The following graph shows a rough distribution of the percent of external traffic classified as dark social for a given domain for a sample of Chartbeat’s data, with the mean given in red:

We can break this data out further. The number is markedly higher on mobile, with upwards of 50% of mobile external traffic lacking a referrer on some sites. This is already a critical problem — how are we to analyze our top traffic drivers if we can’t attribute half of our traffic? — and since mobile’s share of traffic is increasing, it’s only going to get worse.

Potential Sources of Dark Social

In order to get a handle on the drivers of the problem, we did an empirical analysis of potential sources of dark social by setting up a site, posting links to it on various traffic sources, and clicking those links from a wide variety of traffic sources. The goal was to determine which traffic sources can be reliably assumed to not be dark social (because they always successfully set the referrer) and which do contribute to dark social (because they always lack a referrer or sometimes lack a referrer).

We were specifically interested in looking at some of the most popular social mobile apps. The following table shows whether some of the combinations of sites and modes of interaction successfully passed a referrer in our testing (with the caveat that we only tested the current versions of the applications and were not exhaustive using all different browsers and operating systems):

Referrer Passed?
Desktop
Mobile Browser
Mobile App
Facebook
Mostly
Yes
Sometimes
Twitter
Yes
Yes
Yes
Reddit
Yes
Yes
No
Tumblr
Yes
Yes
Yes
Gmail
No
No
No
IM/Text
No
No
No

 

 

 

We can see that major traffic sources are generally good about allowing their data to be tracked. However, there were some interesting exceptions.

  1. Facebook’s desktop site sometimes doesn’t set a referrer if the onclick listener is avoided (for instance, if you open a link in a new tab/window).
  2. Desktop and mobile traffic from Reddit.com sets a referrer, but the top apps for reading Reddit all do not set a referrer.
  3. It became clear in further analysis (see below) that the Facebook app only sometimes sets the referrer.

Beyond these notes, things generally worked as expected: email, IM, and most mobile apps were dark social; social networks and major sources of external traffic (even some using HTTPS: like Facebook and Google) were not.

Disambiguating Traffic with Time Series

The above findings raised more questions than they settled. If less well measured sources like email and IM drive a significant portion of traffic, do they at least correlate well with more explicitly measurable sources of traffic? For applications like Facebook and Reddit that do not always send referrer data, is there a way to identify their contributions within patterns of dark social traffic? We found that in many cases, the answers to these questions were a resounding yes.

For the next phase of our analysis, we wanted to take a look at the time series data for specific articles to try to identify patterns in the traffic. If a popular story were to break, you’d expect to see different responses in different traffic sources. For a site like Reddit, you might expect traffic to be tightly peaked and highly correlated with the story’s ranking on the home page. For a site like Facebook, the interest might fade out more gradually as it filters through different people’s feeds. You might expect instant messaging to yield a tighter, shorter-tailed traffic distribution than a medium like email. The following plot shows an interesting example of a story that illustrates some of these features. There was a distinct spike in Reddit-driven traffic that lasted all of four hours followed by a more prolonged pickup in Facebook traffic.

The most interesting observation here is how well correlated dark social traffic is to the identifiable sources. In this example, you could be convinced that the dark social is really just misattributed traffic from facebook and reddit. Some evidence for this:

  • The residual traffic is almost non-existent, and in particular, the amount of internal and search traffic is negligible.
  • If a secondary social sharing mechanism like email or IM were driving a significant amount of traffic, we’d expect to see some delay in the dark social time series from the sharp spike in traffic.

We can further break these numbers down by examining the difference between mobile and desktop traffic. In the following graph, we zoom into the Reddit spike in traffic in the above article.

We can see a stark divergence in traffic patterns by device, which confirms some of our earlier findings. We have at least strong anecdotal evidence that large portion of Reddit mobile traffic is from apps categorizing traffic as dark social.

We can examine the patterns of traffic for the Facebook-driven portion of the time series as well:

Here, the picture is not quite as cut-and-dry as before. Dark social comprises only a small percent of overall desktop traffic, but commands a fairly significant chunk of mobile traffic. Over other articles, this pattern is typical. When we observe Facebook traffic, we can almost always find a corresponding amount of dark social traffic. The actual amount of dark social traffic relative to Facebook traffic can vary significantly by article and by site, but will generally be much higher on mobile devices. As Facebook is such a large driver of mobile traffic in general, this can help explain some of the difference we see between desktop dark social share and mobile dark social share.

Of course, it’s difficult to disambiguate where dark social is coming from at scale — it’s a mix of traffic from many referrers. But, for a large majority of stories, if we look at the top 10 referrers and correlate the time series of traffic that they send with dark social’s time series, we get some referrer that’s a very high match, which strongly suggests that that particular story is getting its dark social from that particular referrer.

This suggests that, while we can’t just flip a switch and disambiguate all traffic, a careful analysis of a particular story is likely to be able to turn up the source of the majority of its dark social. Of course, this won’t always work– there are still person-to-person shares (IM, email, etc), shares on apps with no corresponding website, and so forth that account for a chunk of dark social. Still, if we look at correlations between dark social traffic and other traffic sources (a rudimentary and blunt tool to be sure), we see that fewer than 25% of stories have time series that have less than 80% correlation, with many being much more highly correlated.

Disambiguating App Traffic Using User Agents

In this analysis, we discovered that many major apps set a string in the user agent that can be used to identify the app, even in the case that the app doesn’t set a referrer. Facebook, Buzzfeed, Twitter, QQ, Baidu, and others all do this. By looking at this user agent string and using it to identify the referrer, we’re able to disambiguate a non-trivial portion of dark social traffic and correctly attribute it to specific mobile apps. We recently implemented this change, and if you happened to be looking closely at your dashboard around 6pm last night, you might have seen your m.facebook.com traffic jump up by 40% and your dark social fall by 5-10% when we flipped the switch. While this is only a small piece of the overall dark social share, it is a clear step in the right direction. As more apps take similar measures, this approach has the potential to help reverse the growth of the dark social problem.

Going Forward

As we get more data from the User Agent change, it will be interesting to see how much of the relationship between dark social and some of the major applications remains. Will the relationship between dark social and Facebook mobile traffic disappear? Well, probably not, because there will still be people who see a link on Facebook and then share it through text or email or other means.

Still, the general approach of looking into your articles’ traffic patterns is quite fruitful — you’re likely to be able to identify the source of dark social for specific stories if you choose to dive in (feel free to reach out if you’d like advice on how to do it using our historical APIs).


More in Research