Links We Like: Data Engineering

July 17th, 2013 by Jerry

At Chartbeat we're currently pushing to building out many new products. All of our work is focused on data. My part in that is ensuring our infrastructure is prepared to handle all of the data that we need to push through it. At peak on an average day, our servers handle about 130K requests per second. Even a new product that is, essentially, "just looking at the data in a different way" can require a lot of engineering. Our real-time focus has let us get away with keeping only aggregated data for historical look-back. As products evolve, this will need to grow to a larger and larger fraction of the full signal that we capture, which in turn will need a storage engine to handle it. Redshift is one of the tools we're looking at. A couple of recent posts have helped us out quite a bit while forging into new territory. As the ability to capture and store data becomes easier, simple engineering tasks around that data become harder. Take counting. Basic algorithms to count the number of distinct records in a dataset -- that is, assuming the dataset has repeats -- have been around for decades, but the simple ones assume that the subset of distinct items will fit in memory for one machine. That assumption is routinely broken by today's standards. Recently I've been brushing up on the ideas around probabilistic counting.