Archive for the ‘Data Science’ Category

My headphones are in, and I’m listening to Jóhann Jóhannsson’s The Miners’ Hymns — one of my favorite albums for coding. I’m finishing up an API for our new Heads Up Display (HUD), for which I’d worked out the math a few days earlier. I had spent the previous day figuring out how to implement the math and testing out edge cases with synthetic data, interspersed between product planning meetings and debugging a performance issue with a new component of the HUD backend. I’m about to put out a Pull Request, when I take a look at Nagios and notice that one of the systems that powers the current HUD has just gone critical. I start to debug, and a second later I get a Slack message from someone on Chartcorps saying that customers are starting to notice that the HUD is down. I see that it is a simple fix this time; I just have to restart one of services that powers the HUD.

Just in time, too, because I have to head uptown with members of our sales team to talk to one of our strategic clients about our new headline testing product. On my way out the door, one of our designers pulls me aside to look at the current designs for displaying the results of headline tests in the new HUD: “Does this viz accurately represent the data?” We talk for ten minutes, weighing pros and cons and looking at design alternatives. We talk about color schemes. We talk a bit about user interaction.

The meeting uptown goes wonderfully; I give a high-level overview of multi-arm bandit headline testing, answer some technical questions about the product, and get great feedback about the product to take back to the team. When I get back in the office, I see a message from Lauryn Bennett, our Head of Brand, asking if any of us on the Data Science team have time to answer a request from a journalist about a news event that just happened. This particular request doesn’t require an in-depth statistical analysis, so I write a quick script to pull the numbers. I spend a bit of time looking at the results and then write up a few paragraphs describing what I’ve found. I then head into a meeting with fellow engineers, designers, and product owners to plan our next Sprint.

This is my typical day.


According to the Harvard Business Review, data science is the sexiest job of the 21st century. If you have data, you need a data scientist to get value from it; data scientists are the only ones who can wrangle #BigData into submission. Apparently, data science will save us all.

I’ve read many pieces over the past year trying to describe what data science actually is. There’s usually some talk about math and programming, machine learning, and A/B testing. Essentially these pieces boil down to one observation: data scientists do something with data. #DeepLearning anyone? I’ve followed arguments on Twitter and blogs about who should and should not be considered a data scientist. Is Data Science even a new discipline? How does it differ from Statistics? Programming? Or is it this…


Ok, then, what the hell does a data scientist actually do?

Now this is a question I can answer. And since I haven’t read many concise descriptions of what data scientists do day-to-day, I figured that I’d throw my hat into the ring and talk about the kind of data science we do here at Chartbeat.
“WARNING: it may be a bit different than what you might have heard that data scientists typically do for a living.”

OK, so, what exactly do you do?

Our team here at Chartbeat are what I like to call Product-Centered Data Scientists — meaning the majority of things we do on a daily basis are in direct support of our products. Because we are a data company, our role is pretty central to the organization. Of course, we do math. We build data pipelines and write production code. We do all kinds of analyses. But we also work regularly with sales and marketing. We go on customer visits and help out with sales calls. We even participate in user research with our designers, UX, and product owners.


As a tech company, we build software products. Plain and simple. As a data company, every one of those products has a data science need. Because of this, our team is embedded within the engineering team, and most of us take on heavy backend or front-end roles in putting code into production. We don’t just hand prototypes over to engineering for them to implement. We do the implementation. We tune our Redshift clusters, find API performance bottlenecks, choose the proper data structures. We are also part of the backend on-call rotation. If Chartbeat were to break at 2AM, we’d help fix it.

For example, just consider our Engaged Headline Testing tool. Andy Chen and Chris Breaux have been instrumental in designing, building, and maintaining the systems that power headline testing. Andy worked out the initial math for adding Engaged Time into the multi-arm bandit framework and was one of two people who built the initial backend. Chris Breaux has since taken over the Data Science role on the team and continues to push the math, and the product, to new places. The new features that will be released soon in that product are — in no uncertain terms — data science features.

In fact, all of us play central roles to each of the products with which we are associated. Josh Schwartz and Justin Mazur have built an enormous portion of our Ads Suite, Kris Harbold and Josh have built all of Report Builder, and Kris holds the distinction of being our only team member to have both front-end and backend code in production. Justin and I have worked on our Video Dashboard, and I’ve built a lot of the HUD. Each of us has contributed countless lines of code to all sorts of systems across Chartbeat.

“I don’t think it is an exaggeration for me to say that there is not a part of Chartbeat code that a data scientist has not touched.”


Okay, so we do math. This just comes with the territory. Sometimes sophisticated math, sometimes rote math. This math is either in direct support of a product or is part of an analysis we’re working on. We don’t do math every day, but when math is needed, we are there to answer the call.

Research + Analysis

Analysis is typically thought of as an essential skill of a data scientist, and we definitely do our fair share. These analyses range from customer specific reports to industry-wide analyses to analyses that inform a specific product build. Take, for example, the analysis Chris Breaux did on Dark Social traffic, or the countless studies Josh Schwartz, our Chief Data Scientist, has published on our blog and elsewhere. Or take, for instance, the research that Justin and Chris recently did towards updating our Engaged Time measurement methodology, the work Kris and I published on user journeys through websites, or the work Jeiran Jahani is doing to break new ground in topic detection. If there is a question that we can use our data to answer, we’ve likely been tasked with answering it. Sometimes our analyses take a few minutes; sometimes they take a few weeks. Sometimes we have to dig deep into our bag of tricks and pull out sophisticated statistical tools. Sometimes we write simple SQL queries to calculate averages.

User Interviews + Ethnographic Research

With our product designers and product managers, some of us on the data science team sit in on user interviews and do ethnographic research. This is not something that I’ve seen as common to data scientists at other organizations, but I think it is an incredibly important activity for a product data scientist to participate in.

I know a lot of data scientists and engineers who roll their eyes at this kind of stuff, but understanding user goals helps in the design of a data pipeline, the choice of an algorithm, or the decision for which metric is best for a given application. It makes you empathetic to your user base, which is never a useless endeavor. What product-centered data scientists do is try to keep in our heads at all times exactly what has to happen to create an amazing user experience.

“From the ugly, messy data at the start of the pipeline, to the user’s interaction with the tool, the user is interacting with data, and that has to be in our purview.”

These interviews also give context for where you can be lax with assumptions, because you often have to make trade-offs when you try to implement your fancy models. Sometimes all that great math adds one second to the response time of an API, and when you have traffic like ours, sometimes you can’t afford one second. Knowing the fidelity that your users expect or require helps solve this problem.

When we were redesigning the HUD, I sat in a variety of newsrooms with one of our designers and watched editors work. We simply watched them use our product in their day-to-day flow, and asked questions now and again about what they were doing. I also sat in a few user interviews during this time and have since sat in on countless others. Those experiences have influenced the engineering and data design of the HUD, as well as several other products I’ve helped build. And now, I can’t imagine being part of a product build without having done at least some user research.

Ideation + Future Products

Product-centered data science is not all about maintaining current systems or developing feature increments. There is also a large amount of long-term vision thinking. What will our products look like next year? In the the next five years? Often, our team will prototype systems to test feasibility of an idea or a product direction. We comb through the latest data science papers and computer science literature to see if any of the latest findings can be applied to future (or current) products. Once every six weeks, we set aside a week for our entire team to do data specific projects that aren’t directly connected to current projects. We’ve built some cool stuff — a system that scrapes and searches content across our network, a tool that discovers popular stories in the news, a deep recurrent neural net to predict traffic, a Slackbot recommendation engine — you name it.

Sales + Marketing

Not only do we help design and build the products, but we do what we can to help sell them, too. We’ll often pull customer-specific numbers, industry benchmarks, or even do full-on reports for the sales team to use on on sales calls. Sometimes we’ll even sit in on those calls and other client visits. We write blog posts and our Data Science Quarterly, which help out the marketing team grow our customer base. We write product white papers. We give interviews to reporters. Basically, we are tasked with speaking to whomever on behalf of Chartbeat Data.

Product-Centered Data Science

This is product-centered data science — at least here at Chartbeat. Personally, I think every product team should have a data scientist on it. Data science is about storytelling, and so are product design, sales, and marketing. There are so many intersections in thinking that it just seems natural for us to be involved in all these parts of the business. I might be in the minority, but for me, data science really has nothing to do with #BigData. It has nothing to do with machine learning. It might not have anything to do with statistics. It is about asking questions, developing user empathy, creating an experience, and telling a story. Our medium is data, our medium is code, but the outcome are fantastic product experiences.

We’re always looking for great storytellers: whether data scientists, account managers, or backend engineers. Come join us.

You’ve got the quality content. You’ve got the attention of your audience. You’ve got the metrics to prove it. But how do you go about monetizing that attention?

One viable answer to that question lies in Engaged Ad Refresh. By loading new ads into positions that have already been actively exposed to readers for a significant amount of time, Engaged Ad Refresh creates additional premium ad inventory to sell to advertisers. In other words: high quality content rewarded by commensurate ad revenue.

Read the rest of this entry »

What We Built

If you had access to tons and tons of data, what would you build? How would you use it? What would you analyze? As a Data Science Intern this summer, I availed myself of Chartbeat’s unparalleled access to content data and built something really, really exciting — we call it: Media Monitor.

Media Monitor explores the variation in political news coverage across the United States. It’s a tool that identifies and displays news topics receiving disproportionate coverage in each of the 50 states, relative to the rest of the country. Or — select a presidential candidate, and it will show you the topics most frequently associated with that candidate in each state.


How We Built It

Take One: Fun With Memory Errors

The data we have on news content – the information that powers Media Monitor – is stored in Elasticsearch, a document base optimized for in-depth search. With Elasticsearch, you can explore statistical properties of documents, most notably, through a “significant terms” query. This query lets you define a category of documents, called a “foreground set” – for example, “all the articles published in New York” or “articles containing the word ‘immigration’” – and then tells you what terms are disproportionately common among documents in that category relative to all the other documents (the “background set”).

This is exactly what we wanted Media Monitor to do: identify those political news topics that received more media attention in a particular state than other states AND those topics associated with one candidate more than other candidates.

For example, all the major presidential candidates have made statements about the economy, but whose message is earning media attention? Or more macroscopically, how do states differ in their coverage of different candidates? To see how Elasticsearch’s significant terms could help us answer these questions, I constructed a significant terms query against the complete content of all the articles in our Elasticsearch cluster, using “articles containing the word ‘immigration’” as the foreground set. I sent the query to our smaller test cluster…

And waited.

And waited some more.

And finally checked the health of the test cluster, and realized that the whole thing had crashed.

After digging through log files for a while, I figured out that the significant terms query had run out of memory; it crashed the entire test Elasticsearch cluster.
Our data set was just too big – but not because it contained too many documents — “too big” because it had too many unique words.

To facilitate searching, Elasticsearch stores what it calls an “inverted index,” a dictionary that maps each word to a list of all the documents that contain that word. That way, if I search for “immigration,” all it has to do is look up “immigration” in the inverted index and return to me the list of documents it already has stored right there. However, when Elasticsearch receives a significant terms query, it loads the entire inverted index into memory to do the math it uses to determine which terms are significant. If the inverted index is too big, the cluster overruns its memory and all hell breaks loose. No matter how simple or complex the parameters of our significant terms query, there was no way we could successfully run it on the complete content – with all the unique words – of all our articles.

Take Two: Defining “Relevancy”

We currently have about 15 million documents stored in our Elasticsearch cluster, and that doesn’t cause any problems. Elasticsearch is designed to handle large numbers of documents easily.

Among those 15 million documents, we have upwards of two million unique terms. To run a successful significant terms query, however, we benchmarked that we’d need to have ten thousand or fewer unique terms in our inverted index.

Fortunately, of those two million terms, the vast majority are not very useful. Many are stopwords — words like “and” or “the” that appear extremely frequently and carry almost no information. Some are words in languages other than English, which we decided to discard because most discussion of the American political process occurs in English. Many words in that index are highly specific to news topics other than politics (e.g., sports, music, or food), while others are political, but don’t tell us much on their own (like the word “politics” itself).

Since we already process the content of each article before adding it to Elasticsearch (to get word count, language, and so on), we could easily extract only the words from each article belonging to some smaller, more informative set, and store them alongside the full content. This would let us run a significant terms query just on this collection of relevant words, and we could pick the list of relevant words to keep in advance to make sure it contained fewer than ten thousand terms. I just needed to figure out how to decide which words were relevant.

As a first attempt, I examined the list of articles tagged as “politics” in our dataset and selected words from these articles in two ways. First, I used a “named entity extractor” to pull out names of people, places, and organizations. Useful topics like “immigration” or “economy” aren’t named entities, though, so to capture those, I tried a simple thresholding system: throw away terms that appear too many times in the corpus of politics articles (mostly stopwords) or too few times (mostly misspellings) and keep the words in between those two thresholds, which are the most likely to be informative.

Combining these two lists gave me roughly five thousand terms – an easier number. I modified our data pipeline to keep words from the combined list and store them separately from the full article content in Elasticsearch. Within a few days, enough new articles had been processed this way to see the results. I ran another significant terms query, using “articles published in New York” as my foreground set, but this time using the filtered version of all the articles rather than the complete article content. I sent the query to the test cluster, and this time the results came back almost immediately.

But then I looked at the actual results. The top five terms for articles from New York were “new york,” “manhattan,” “brooklyn,” “albany,” and “de blasio.” It was clear that, mathematically speaking, the significant terms query was doing exactly as advertised – we would certainly expect there to be more articles about Manhattan in New York than in other states – but this kind of obvious association wasn’t really what we were looking for.
Our wordlist contained far too many obviously location-specific terms (mostly places and names of local politicians) that would outweigh any interesting regional differences by default.

Take Three: If You Want Something Done Right…

Clearly we needed to remove the unhelpful local terms from our wordlist if we wanted to see anything other than a list of places in each state, sometimes mixed with the names of its governor, senators, representatives, or prominent mayors. The problem is that even though a named entity extractor can identify which words refer to places and which refer to people, it doesn’t know which places are in the United States, and it doesn’t know which people are local politicians. We couldn’t just discard all places (Iran, for example, is certainly politically relevant) or all people (our tool wouldn’t be very useful if it couldn’t give any information about the current presidential candidates). No — this would have to be done by hand.

I spent several hours combing through the wordlist, removing the names of states and cities in the US, current or recent state politicians (excluding those currently running for president), and more political but non-distinguishing terms (like “treaty” or “govern”) that wouldn’t mean much if you saw them listed among a state’s most significant topics. I also added some useful terms that had been omitted for various reasons (largely because they became were time-sensitive topics after I generated the original wordlist), like “planned parenthood” and “sandra bland.” The result was a much smaller list, only about 600 words. I put the new list in place and tried again.

Finally, with this new, heavily hand-curated wordlist, significant terms queries started to give us results that made sense and actually conveyed some interesting and useful information.

News outlets in bluer states were noticeably less likely to have “benghazi” among their top terms associated with Hillary Clinton than in redder states. Both “immigration” and “undocumented” are among the top terms associated with Donald Trump in essentially every state (in addition to “megyn kelly” appearing associated with him in several states), while Jeb Bush isn’t strongly associated with any particular topics across the board – his top terms in most states are either lists of other Republican candidates, generic acknowledgements that he’s running (“presidential,” “2016”), or topics that the state is highly concerned with independent of him (e.g., “immigration” appears in Arizona, Texas and Florida with or without a Jeb Bush filter). Chris Christie is mentioned more in New Jersey than any other state, including New York and California, which both have many more total articles in our dataset. Iowa has a much larger number of articles mentioning each of the presidential candidates by name, relative to the total number of articles published in the state, than any other state at this point. The topic of “violence” is strongly associated with “gun control” and “second amendment” in New York, with topics like “ferguson,” “naacp,” “police,” and “protest” in much of the South, and with “syria,” “yemen,” and “gaza” in California.

Some of these results are unsurprising, and it’s interesting to see the data confirm our intuition; others – particularly the major differences in focus between what sorts of violence are being covered in different states – reveal something novel.

There were some major trade-offs in creating Media Monitor: because the wordlist had to be examined and altered by hand, we don’t yet have a good sustainable way to keep it updated as new topics become relevant in the news. And because Media Monitor operates off a limited wordlist, users can’t search whatever topics they want (although hopefully we’ve done a pretty good job of making sure the most important topics are included). However, in the end, we’ve managed to put together something that can provide some meaningful insight into discrepancies in political news coverage – and I learned a huge amount in the process.

ICYMI: Last week, our Chartcorps and Data Science teams hosted a Q&A webinar covering the latest mobile research from our spring Quarterly issue. You can check out a full recording of the webinar (or read the Cliff Notes version) here.

You asked, we answered.

Attendees asked lots of great questions during last week’s mobile research findings webinar—some of which we didn’t have a chance to answer during the live Q&A session:

What are some strategies for increasing pageviews per user on mobile devices?

Whether you’re promoting your site on social media or deciding which articles to put on a landing page, consider sending traffic to pages with high mobile recirculation, which tend to drive readers further into your site.

Filter on the “mobile” platform in the Chartbeat Publishing Dashboard and sort your Top Pages section by Recirculation to quickly see which articles are doing the best at driving your audience deeper into your site.

To increase mobile recirculation on a particular page, take a look at where mobile visitors scroll to before leaving, either in real-time with the Heads Up Display or historically with Report Builder.

How can digital publishers tailor content better for mobile platforms?

With mobile platforms, the best way to optimize for your audience is to make sure that the most viewable part of the page—just below the digital fold—is set up for success. Are you giving your audience a chance to move on to additional content before that first big drop off? And if they do drop off after that point, are they at least leaving with a strong takeaway from the article?

If you’re constantly tracking your mobile audience, you’ve probably also noticed trends in terms of which referrers typically send traffic via mobile devices, and even what time of day you usually see the highest amount of mobile traffic. These insights can all inform you on when and where to promote your mobile stories.

Track the second-by-second, pixel-by-pixel attention of your audience with our Editorial Dashboard.

Have you noticed a difference in mobile consumption with the increased popularity of “phablets” — larger screen phones like the iPhone 6 plus or the Galaxy Note 4?

Right now we break down devices into three categories—desktop, mobile, and tablet—so we haven’t looked at any trends yet in the various kinds of mobile devices, but we’ll keep it in mind.

How do you measure scroll depth in the case of these studies?

To measure maximum scroll depth, we look at the the furthest point a user scrolled on the page, as tracked by our pinger, a piece of Chartbeat javascript that runs in your browser. Note that maximum scroll depth isn’t necessarily the point at which you left the page, although the two are often the same number.

To get the maximum scroll depth data for the Data Science Quarterly, we used the same data store that feeds our Report Builder.

What kind of correlation have you seen between scroll depth on landing pages and pageviews per visit?

We haven’t done a study on this correlation yet, but if you’re using the Report Builder tool you can build recurring reports tracking pageviews and unique visitors for a specific page.

Set metrics to “pageviews” and “unique cookies” and set a page path filter equal to “/” — the path of your homepage — and filter on device type equals mobile. That way you can keep track of the numbers for your own homepage day in and day out and discover any trends that might be unique to your own audience. (For more tips and tricks check out Report Builder 101).

Learn more about our suite of editorial tools here. Questions? Shoot us a note or check out our Chartcorps Support Site.

Did you miss last week’s webinar on mobile findings from the new Data Science Quarterly? Saw it and want to review the fundamentals? No fear. You can watch Andy and Chris here.

Want the quick scoop? Here’s a recap of their discussion:

During the webinar Chris from Chartcorps and Andy from our Data Science team walked through the latest research from our new Data Science Quarterly to help you better understand the data and leverage the insights we’ve found. Particularly, they talked about viewability and user behavior in a mobile context. Some questions they considered:

  • What parts of the browser page are most viewable?
  • What is the relationship between scroll depth and viewability?
  • How are mobile and desktop viewing experiences different?


    The most viewable part of the page for mobile browsing is just below the digital fold, but for desktop browsing it’s above the fold. Nevertheless, the majority of users exit near the fold on both platforms.

    For mobile, we see a lot more scroll depth (vs. desktop). At certain points on the page, there are pronounced peaks in active users. Andy explained this incongruity in terms of scrolling methods: on phones, users tend to scroll in more discrete chunks, whereas on desktops, scrolling is much smoother. As a result, mobile users tend to scroll much further down the page.

    Because of the unique nature of mobile scrolling, ads that must be reached by scrolling a few times see an uptick in viewability.

    That said, it’s important to note that scroll depth is a proxy to viewability—a user might scroll to the bottom of the page so quickly that the ads do not qualify as viewable.


    When you understand how your audience consumes content, you can better adapt to create a more engaging experience. If, for example, you know where on the page reader engagement is coming a halt, you can adjust elements on the page or rework your actual content to keep your audience actively reading.

    These are the kinds of insights and tips we prioritized when building the new Heads Up Display. Chris closed the webinar with a sneak-peak of the new Heads Up Display and all its features—many of which are mobile-focused. In real time, you can analyze what mobile users see, how they interact with the page, and, where they are spending time on the page.

    By pairing real-time metrics like Engaged Time or scroll depth with historical data from our Report Builder tool, you can instantly act on opportunities—and learn from those actions to build out future strategy.


    Since you know that scroll depth is strongly linked to viewability on mobile devices, you can build historical reports to evaluate which mobile pages correlate with those metrics. Using Report Builder, you can see a ranking of your most scrollable articles: create a one-time report using a “device type equals mobile” filter, select “scroll starts” and “average scroll” as your metrics, and group by “page path”.

    You could also investigate a specific article’s mobile performance by setting its page path as a filter. That way you can track it day in and day out to see if any changes you’re making to the article’s layout are affecting users’ scroll depth. (For more sample reports check out our Report Builder 101 post).

    For the whole shebang, including more historical reports, watch the full webinar.

    We also had some great follow up questions from our webinar attendees. Check out the answers from Andy and Chris here.

    Questions about the data? Shoot Andy an email. Want to know more about our Heads Up Display or Report Builder Tools? Get in touch.