Links We Like: Data Engineering

Jerry

12 years ago

At Chartbeat we’re currently pushing to building out many new products. All of our work is focused on data. My part in that is ensuring our infrastructure is prepared to handle all of the data that we need to push through it. At peak on an average day, our servers handle about 130K requests per second. Even a new product that is, essentially, “just looking at the data in a different way” can require a lot of engineering.

Our real-time focus has let us get away with keeping only aggregated data for historical look-back. As products evolve, this will need to grow to a larger and larger fraction of the full signal that we capture, which in turn will need a storage engine to handle it. Redshift is one of the tools we’re looking at. A couple of recent posts have helped us out quite a bit while forging into new territory.

Shout out to the Aggregate Knowledge Tech Blog with their post, AWS Redshift: How Amazon Changed The Game. It’s inspiring to see how thoroughly they approach engineering problems.
It’s also nice to look over the shoulders of our neighbors in New York. Bitly has started integrating Redshift into their tech stack, and their post, Speeding things up with Redshift, was a timely read.
Speaking of NYC Tech, one of the big stories of 2013 is Yahoo!’s acquisition of Tumblr. The scale of data moving through any social network’s infrastructure is enough to make the best engineering teams pause to reflect. High Scalability posted a fascinating invetory of what Tumblr has been building: The Tumblr Architecture Yahoo Bought For A Cool Billion Dollars.

As the ability to capture and store data becomes easier, simple engineering tasks around that data become harder. Take counting. Basic algorithms to count the number of distinct records in a dataset — that is, assuming the dataset has repeats — have been around for decades, but the simple ones assume that the subset of distinct items will fit in memory for one machine. That assumption is routinely broken by today’s standards. Recently I’ve been brushing up on the ideas around probabilistic counting.

A nice intro (with technical details and visualizations!) is again from aggregate knowledge, Sketch of the Day: Probabilistic Counting with Stochastic Averaging (PCSA). Admittedly I happened on this after reading their Redshift post.
Reading up on probabilistic algorithms prompted me to revisit the post on probabilistic data structures at Ilya Katsov’s blog, Highly Scalable.