Archive for December, 2013

Hannah Keiler was our Fall 2013 Data Science intern here at Chartbeat, working with Chief Data Scientist Josh Schwartz. Hannah is a senior at Columbia University, where she studies Statistics with a concentration in Computer Science. This blog post details one of several projects she tackled during her internship at Chartbeat.

At Chartbeat, we sometimes want to compare metrics across similar sites. There are several different ways to group sites. For example, you can begin by thinking about grouping sites by size – comparing metrics like number of readers or articles published each day. We were also interested in grouping together sites that write about similar content. Grouping sites by content manually for thousands of domains is incredibly tedious, so we wanted to devise a metric that would allow us to group similar sites automatically.

One way to define sites as having similar content is if they write on similar subjects at around the same time. If sites write about the same subjects, they are probably using the same key words, like “Obama” or “Syria.” We knew that the words that best summarize the content of an article are likely the words appearing in its headline. Keeping these ideas in mind, we developed our metric.

Computing Similarity

We start by comparing sites two at a time. Let’s call the sites A and B. We look at the words used in the headlines in A and B day by day.

For each day, we record the words used in both A and B and compute a weighted sum of their counts. That means that we divide the number of times a certain word occurs in both A and B in one day by a number indicating how often that word occurs in headlines in general. Weighting the word counts helps us to pick out two sites that write about niche topics by giving more weight to rarer words. For each day, we then sum up all of these values and then we sum up all of the values for all of the days. Let’s call this final sum “Value 1.”

We also record all of the words used in headlines by either A or B for each day.  Then for each day we compute a weighted sum of these word counts and then add up all the weighted sums from each of the days into one value. Let’s call this “Value 2.”

 Then we divide Value 1 by Value 2. We now have a ratio of sorts of the number of words A and B share versus the number of words they use in total.

How does this look?

We first computed the similarity metric for sites whose content we thought was geared towards sports, music, or celebrity/entertainment news. To visualize the similarity metric, we plotted the sites as nodes in connected graph.



 FYI: These graphs are anonymized because we don’t share individual client data

The distance between the sites represents their similarity. Closer sites have a stronger similarity metric. On this graph, the sports sites are dark blue, the celebrities sites are red, and the music sites are teal. As you can see, sites with similar content group together! The fact that the celebrity sites are in the middle implies that they share some content with music and sports sites, which makes sense. The outlier posts fewer articles daily than the other celebrity news sites, so there was less overlap in term usage and, accordingly, the similarity metric was lower.

 We also tried out our metric with British and Australian news. We get the graph below.



Here, the UK sites in red group together and the Australian sites in teal group together. The outlier writes more niche news stories than general Australian news, so it had less overlap with the other Australian and British news sites.

Wrapping Up

These initial results show that sites that post articles with the same topics in the headlines at around the same time tend to be similar types of sites. Moving ahead, this could be a great way to group sites into different categories based on their content.

Phillip Smith is a digital publishing consultant who focuses on news innovation, specifically “the technology and ideas that are shaping how users interact with journalism online.” You can find him on Twitter and at

Phillip Smith Chartbeat

When I created my first Chartbeat account back on March 23, 2010, I didn’t anticipate that just a few short years later I’d be asked to host a talk at Chartbeat’s NYC HQ to explore how publishers are going “Beyond the Click” to get to actual engagement.

I also didn’t anticipate the impact of that event on my thinking about the nexus of publishers, the technology companies working in the publishing space, advertisers and media buyers, and — most importantly — the engaged users.

At the conclusion of the event I had a good hunch that this loose collection of ideas and initiatives, which fall under the heading of “new metrics for publishers” — happening very openly at companies like Chartbeat and Disqus, but also unfolding in newsrooms around the world — was likely to be a key theme in 2014, along with unprecedented technology innovation in newsrooms, the disruption of the existing advertising models,  and the changing demands of users.

Phillip Smith Chartbeat

Publishers take center stage on the Web

If you take it as face value, recent research tell us that roughly 78% of adult internet users in the U.S. go online to “get news.” Add to that some of the other main reasons that people go online — e.g., “to look for information about a service or product they are thinking of buying” or “to find information on a hobby or interest” — and you’ll find that online publishers are often providing that information too.

This puts publishers in a very enviable position, I would propose, as the producers, purveyors, curators, and gatekeepers of much of the Internet’s most sought-after commodity: fresh, timely, contextual, and relatively-objective information. One by one, publishers appear to have navigated their ships’ course to adapt to the changing landscape. They are focusing attention on a new, quickly evolving role as the central marketplace of information, attention, and engagement.

Newsrooms as software innovation labs

More interesting still is the relatively new trend of publishers investing in digital staff that are not stuck in the outdated role of “IT.” Working at a distance from those folks that are managing the servers and infrastructure, these new “news apps” and “good Internet” teams are embracing experimentation. They are pushing the envelope of what users have come to expect as “content” and “information.” The outcome is new forms of journalism and storytelling that invite the user to be more engaged, whether through data, interactivity, or awe.

One side effect of these investments is a quickly-maturing technical acumen in newsrooms and a wealth of contributions back to the open-source software community — the same community that helped to make much of the innovation possible in the first place. A natural feedback loop is born: more technical innovation and open-source contributions by newsrooms lures more talented developers away from other sectors, and more talented developers often means more innovation in those same newsrooms.

Another upside of this publisher-driven innovation, is the pressure it exerts on the ecosystem of technology vendors they collaborate with, compelling those same vendors to evolve their products more rapidly and to open up new ways for programmers to work with their products.

Disruption of the display advertising model

Pressure is also being exerted on many of the players in the traditional online display advertising space, in many parts due to the experiments that online publishers are undertaking.

For example, just one factor out of several — the increased consumption of online content on mobile devices — has meant that forward-thinking publishers have all but eliminated traditional display ad formats from their now “mobile first”  Web properties; the lonely “big box” display ad is almost all that remains in many recently launched “responsive” sites that aim to provide an all-in-one experience for readers arriving from phones, tablets, laptops, or desktops. The Boston Globe,, and NPR are good examples.

Other pressures, like low reader engagement with display advertising, and the complicated web of ad-delivery networks, slow ad-serving technologies, and the increasing practice of “programmatic buying” that sidesteps the publisher’s sales team, have lead to a wave of publishers experimenting with new ad formats that fall under the term “native advertising”  (or “sponsored content”), where the distinction between advertising and content blurs considerably. However, almost overnight, publishers have bootstrapped their own solutions to two challenges: ad formats that adapt well to the mobile reading experience, as well as, in many cases, increasing reader engagement. Some recent numbers suggest that as many as 3/4 of online publishers in the U.S. are now offering native advertising.

These shifts away from timidly accepting what technology vendors have to offer, or what advertising agencies are pitching, and toward producing in-house solutions to the challenge of increasing their readers’ satisfaction and engagement exemplify publishers pushing the envelope at a time when they’ve come to see the role they play as one of the convening places on the modern Web.

Users start to pay for content, but want privacy too

At the other end of these shifts, however, are some established technology companies making quick moves to address these new challenges head-on with their existing products. For example, Chartbeat’s push to create a new metric for publishers, “Engaged Time” is one great example (and they didn’t ask me to say that!). Thought leadership is coming from all directions, and it is often touching on this intersection of publishers, advertising, and users.

There is, generally speaking, a surge of analytics and “big data” technology coming onto the market that is aimed at helping online publishers make sense of their mountain of accumulated “user data.” A focus of many of these new analytics platforms is to help publishers understand their audience better, as well as which content performs the best, and publishers are increasingly in the position of making tough decisions about which visits matter to them most when they decide where to invest their editorial budget.

One voice is still often missing, however, or perhaps it is one billion voices: the end-user, the news consumer, the reader, the individuals who use the Web. This coming year, I predict, is going to be about the unfolding story of a new contract between readers, publishers, technology platforms, and advertisers.

 The continued, and relatively successful, introduction of pay walls at mainstream news sites, experimental subscription strategies at the heart of a new bread of entrepreneurial journalism initiatives, and the continued success of “crowd funding” to help support expensive types of journalism all point in one direction: publishers will rely less on “selling users to advertisers” as the exclusive strategy for financial sustainablity.

At the same time, however, users are growing leery from the ongoing revelations of the invasions of their privacy by the companies they once thought were infallible, and thus many are growing more reluctant to unwittingly hand over their information. Practices like programatic ad buying, which make online tracking more directly evident to end-users, and efforts like Mozilla’s Lightbeam for Firefox, which shine a light on how much data is being shared, and with whom, will continue to push users to ask for more privacy from publisher, advertisers, and platforms.

The dynamics at play above — publishers, technology platforms, advertisers, and users — and the tensions being exerted as each one tries to optimize their online experience — whether it’s sharable content, ad delivery, better metrics, or more privacy — is why I’m predicting that “the ascendance of publishers and users” will be a key narrative of the Web in 2014.

What do you predict will happen in online publishing in 2014? Do you agree with Phillip’s predictions? Share your ideas in the Comments below.

For the final installment of our series on Understanding Your Traffic Sources, I wanted to go over some best practices for managing referral traffic and identify a few places where you can use Chartbeat data to support your decision-making.

But first, let’s sum up the data that we’ve seen over the past few weeks. The graphic below shows what sort of browsing behaviors are indicative of visitors coming back to your site, based on many sites’ most common traffic sources.


At one extreme, we have visitors who come to your site homepage direct and are always likely to return. At the other, those who come via Google News are unlikely to return, regardless of how they read. In the middle, though, we have an interesting split:

  • Visitors who come from Facebook are likely to read most of the article they land on, but those who click to a second article are much more likely to return

  • Visitors from Twitter and Google search, on the other hand, consuming the entire article they land on is the best indicator of a likelihood of returning

Traffic from other, smaller sources tends to behave much like Google News or Twitter traffic in this graphic. Now that we have a sense of how different kinds of referral traffic behaves, I’ll dive into right into what actions you can take with this data.

Where, and how, to concentrate your efforts

One of the starkest data points we’ve come across is how much more likely a person is to return to a site via the referrer they come from versus all other referrers combined. Those who come from Facebook are likely to return only via Facebook, those who come from Google News are likely to return only via Google News, and so on. In that sense, the most important thing you can do to grow audience from a given referrer is maintain a steady stream of links from that referrer.


Given that, you should ask two questions. First, what sources should we concentrate on building traffic from? Second, what can we do to build that traffic?

The best way to decide the former, if you’re a Chartbeat Publishing client, is to take a look at the “return rate” and “return direct rate” columns of your Weekly Perspectives. Those columns express, in essence, the value of links from different referrers — those with higher return rates send traffic that’s more likely to return to your site.

If you don’t have access to Chartbeat Publishing, the general trend that we’ve seen is that, unsurprisingly, visitors from social sources have the highest likelihood of returning, while sources like Google News, Reddit, and Outbrain are likely to increase your site’s reach by sending new visitors, but are unlikely to meaningfully help you grow your audience in a self-sustaining way.

The second question, of course, is much harder to answer in broad terms. Taking each traffic source one-by-one, though:

  • Twitter: One thing we’ve seen many times is that people don’t promote posts nearly as often on Twitter as they should. Most sites see the majority of their Twitter traffic coming from their own tweets, and the lifetime of a tweet is incredibly short. Tweeting headlines is rarely the right choice.

  • Facebook: Facebook traffic typically comes from organic sharing, which means it’s harder to predict and control. One thing you can control is Facebook’s preview text, and it’s hugely important. If you don’t know what text is showing up on Facebook’s previews, you need to figure it out.

  • In-network sites: If your site is part of a network, working to maintain links from your sister sites is critical. It’s not uncommon to see return rates over 50% (about twice as high as for typical referrers) for in-network traffic, which is a function both of similarity of audience and of the regularity of links. Fostering these types of link partnerships is one of the best ways to sustainably build audience.

  • Google: First off, it’s critical to separate “branded” search (searches for your domain name or URL) from truly organic search and Google News. Branded search should be thought of as akin to direct traffic. Optimization for organic search is a huge topics unto itself and probably beyond the scope of this post.

A caveat for paywall sites

One place where sites often miss out is with paywalls that are porous for traffic from external referrers, only presenting a prompt to subscribe on later pages. Under that scheme, a visitor, for instance, who always comes from Twitter and only read the article she lands on will never even be asked to subscribe. We’ve seen some publishers move toward differentiated paywalls for exactly this reason — traffic from some referrers is immediately asked to log in while visitors from others are allowed to read an article or two for free.

If that fine-grained control isn’t in the cards, your goal should always be to get visitors to read through to a second article. Looking at “subsequent time” in your Weekly Perspectives should give you some idea of which referrers send visitors that are likely to click to a second page — concentrating on getting traffic from these referrers makes sense. And, understanding where people are leaving each article will give you a clue into where you should be placing link suggestions. Great related links at the top of an article aren’t in view for visitors who read the whole page, and great links at the bottom of an article don’t matter to those who never scroll down to see them.


Wrapping up

We’ve hardly scratched the surface of what can be said about traffic sources. Much of the most exciting data is easiest to find under the hood of your dashboard – the data that’s specific to your site, not the internet as a whole. We’re working on putting out several case studies that look in detail at traffic for a few sites, which we’ll be sure to let you know about here once they go live.

In the meantime, thanks for reading, and if I can leave you with one message it’s this: experiment!

What we’ve presented over the past five articles are broad statistics about traffic across the internet, but we regularly see sites that wildly depart from the average. If you see a return rate of 10% from a given referrer, take that as a challenge and try getting traffic to a different set of links from that referrer and see if you can push next week’s rate to 11%.

Let me know your questions or what you’re seeing in your data in the comments here or by tweeting at @joshuadschwartz; I’ll be sure to come back to your site if you get in touch.


startup job search

So you just graduated from college. Or you’re stuck in the library dreaming about money. Ok, you’re just wasting away at a company that you don’t give a damn about. It’s the perfect time to find work at a tech startup. Not a developer? Never written a line of code? No worries. I got your back.

The idea of working at a startup is appealing, for sure. But if you don’t know where to start, things could get ugly. Here are three surefire ways to find and get hired by an awesome tech startup.


startup job search

Step 1: Remove the Weeds

According to one estimate, about 906,241 tech startups existed in the US as of 2010.

How the hell are you supposed to find a startup that is both building something exciting, and completely legitimate?

You know yourself better than I do. I won’t tell you what’s exciting (real-time data anyone?) and what isn’t. But once you’ve identified that, here are a few ways to pick a winner (and make a good first impression on your interviewers).

  1. Unsure of what you’re looking for? Search through portfolio companies of top Venture Capital firms to find startups that are well-funded and looking for someone with your distinct skillset.

  1. Know exactly what you want? Use CrunchBase, a powerful startup database, to find a company as specific as: a BioTech startup within 5 miles of Manhattan, founded after May 2000, with a maximum of 200 employees, and $1 million in funding (and other specific attributes).

  1. Willing to live in NYC and need to start ASAP? Check out Made in New York City to find all the currently hiring startups in NYC.


startup job search

Step 2: Expand Your Circle

You have an idea of where you want to work, but you don’t have any connections within the startup realm. It’s time to go to tech meetups and find tech-literate people. You probably want to stop reading this post right about here – but hang on!

We’re all in agreement that forced-networking is the most agonizing experience a human being can endure. Don’t worry. I’m not telling you to spam an event with your business card.

Get out there and get acclimated to your new universe. You’ll get a sense of whether or not you’re a good fit. And, you will reduce your anxiety about interviewing at a big, bad tech company.

startup job search

Step 3: Share YOUR Story

Speaking of interviews, when you’re sitting across the table from your future boss, she only really wants to know one thing. Can you fill the hole that’s plaguing the company?

Developers rely on a track record of building scalable systems or shipping web applications to prove their worth.

We (mere mortals) rely upon our ability to tell a succinct, compelling story. The chapters of your story are already there. It’s time for you to find the common thread, and sew that bad boy together. Your story, composed of your work and personal experiences, MUST align with the startup’s mission and requirements for the role.

Say, for example, you’re interviewing at a startup that wants to make life easier for small business owners, and needs someone with related experience.

When you’re asked to talk about your interest in the company, share your story of working at your mother’s clothing store every summer (if it’s true). Dig into the specific challenges she faced as a small business owner. Share the lessons that you can apply if you’re hired by this startup. Tie in how the startup’s product would have improved  visibility  for your mom’s store and increased sales.

It will break the ice, show your deep understanding of the product/mission, and force the interviewer to remember you. Who knows. It may even help you land the job.

So that’s all I’ve got for now. Feel free to sound off in the comment section and I’ll do all I can to help you land that job you’re working towards.

The goal of product interface design is to develop a product’s personality (the new Chartbeat Publishing is friendly, trustworthy, fast, modern), and tell a story (publishers can build and retain a loyal audience from our data). Everything about a product’s design relates back to its personality and story, including the visual style, the interaction design, and the language. This isn’t something we can get at the first go – it requires a lot of experimentation.

So at Chartbeat we have a process in place that allows us to rapidly try out ideas: not just at the design level, but at every stage of the project. We create a quick rough series of wireframes, and do the same for visual design sketches. Early on we prototype these designs in the browser (we love AngularJS over here), and even have a system that allows for quickly prototyping data in the backend (a custom LUA scriptable real-time backend). At every step of the way, we’re testing and tweaking to make sure that our choices support the personality and story we’re trying to portray.

In the middle of all this endless iterating, we reached a point where we’re ready for an initial stable release. While we will keep nurturing and iterating on this product, here are four major design improvements that are in the new Chartbeat Publishing today.

Improved signaling


A major goal in rebuilding Chartbeat Publishing was to further reduce the burden of interpretation on our users, i.e., make the product – not our already very busy clients – do more of the heavy-lifting. That goal was realized by a couple of different approaches, especially figuring out ways to answer common client questions of “what does this number mean?” and “how well is my site doing right now compared to at other times?”.

We expanded the product’s efficacy by using our technology not just to report numbers to you, but to interpret numbers for you, too. For instance, now when you mouse over your Engaged Time section in the dashboard, a tooltip tells you how strongly your site is performing as compared to the past month – are you “on par” or “over-performing” – along with your site’s monthly average Engaged Time and its maximum average Engaged Time. We go one step further by pointing out which articles are potentially responsible for either an over-performing or under-performing Engaged Time. And at all times we call out which articles should be regarded as “good for your site’s health.”

I talk a little more about the subject of signaling and invisible design on my own site.
Next up: see how the product’s signaling starts even before you start processing your dashboard data – thanks to our strategic use of color.

Color isn’t just color

Color is now a fundamental signaling element in the dashboard – we’re moving away from color as a legend – allowing you to interpret what’s happening in the dashboard (and thus on your site), even more quickly than before. If you see green, your brain automatically picks up that something positive is happening, whereas anything red implies that something is underperforming or past its prime. By using these simple cues, paired with a baseline blue palette, users can navigate the product even more efficiently without having to refer to a legend.

General affordance


Our first version of the new Chartbeat Publishing dashboard was powerful in many regards. But we noticed that some dashboard elements, particularly a few crucial features related to building your loyal audience, were hard to discover. This was definitely not the client’s fault – it’s up to Design and UX to figure out how to make these things easy for clients to find. In our case, this issue had more to do with the fact that we had learned even more about the industry after users started beta-testing our first version of the new product.

We decided to make it really easy to find all our different types of filter and sorts we made available. The UI for sorting on Engaged Time is now as prominent as we think it is useful. Filtering by something like your “new” visitors are reading is really easy to find in the new dashboard. So you’re doing less guessing and more getting right to it. This easier-to-navigate interface allows us to expose more advanced features.

Things are more flexible than ever before


One of those features is multi-pivoting, which lets you combine different filters. In the original Chartbeat Publishing dashboard, you could click into a story and drill down into a particular story’s data. The new dashboard lets you pivot and manipulate the dashboard to pull out almost any specific data you want – whether it’s the number of mobile readers you have in Spokane, Washington, or which stories are attracting your most loyal visitors right now – thus increasing the number of actions you can take. Being able to do more within the dashboard allows you to surface the insights you need to inform your decisions and processes.

All in all

Our mission is to keep improving what we hope is an accessible dashboard that presents actionable insights. Right now we’re looking at how people are using the new Chartbeat Publishing, and we’ll take those learnings, along with whatever major needs arise in the industry, to inform the next batch of updates.