Shaun’s Hack Week Recap: Step 1, Tackling the Data

April 16th, 2013 by Shaun

Since my last blog post, things have been moving quickly over here. Hack Week was invigorating (i.e., hectic) as usual, and while I didn’t get to demo my hack because I was out sick, I did make a ton of progress on building a reader that's based on what I'm now calling the Shaun Appreciation Score –  a measure of a story's Average Engaged Time, average scroll depth, and content height. 

Measuring appreciation

The Shaun Appreciation Score is my attempt to measure how much time users are spending within the actual content - the story itself, and not just on the page. I calculated my score by sampling our existing data from thousands of articles across our publishers sites*, as well as writing a system to get new data that we currently don’t have available to us. This includes scraping content from these pages, determining where the content starts and stops, and then figuring out how much time it takes to consume that content. 

Once I collected this massive set of data, I then chatted with Josh, one of our awesome Chartbeat data scientists. He suggested that I start plotting some of this data to get a feel for what the distribution might look like. So I calculated the mean and standard deviation of all the key data points I wanted to measure: average scroll depth, Average Engaged Time, and content height (the physical length of a story on a page).

The beauty of a bell curve

After these calculations, I plotted this data and seeing if there was any correlation between Engaged Time and scroll depth. As it turns out, there is a strong correlation(!). Content that has above-average Engaged Time generally has higher average scroll depth as well, but that’s not the whole picture. I wanted to see how much time was spent within the content, not in the comments section at the bottom, or in the slideshow at the top of the page. There wasn't an easy way to determine this, so I decided taking the Engaged Time relative to the content length would help weed out articles with fewer words, or users who spend a ton of time in the comments section.

Plotting the data in a histogram looked something like this:

image-2

This bell curve is good news, I got something that resembles a normal distribution, which tells me that my data quality is good, and that there is in fact a difference between what I’m calling "good" content appreciation and "bad" content appreciation.
More to do next Hack Week

While I didn't finish building my reader within this past Hack Week, I did get to a point where I'm feeling pretty good about my results so far. I built an API endpoint that will spit out the top 5 and bottom 5 pages ranked by my appreciation score and this is enough to build a rough prototypal UI on top of it. Hopefully next Hack Week I can revisit this application and finish up my ambitious goal of building what's effectively a quality-based reader.

What do you think my reader's UI should look like?

  *While we experiment with the data available from our API, our clients’ data is kept private and never shared publicly.