The newsroom in the digital age is undergoing a seismic transformation. The channels through which news is gathered and distributed diversify continually, and the pace at which news can be – and is expected to be – delivered is always accelerating. In the face of these changes, survival is quite simply the main objective for many of the stakeholders: journalists, editors, and publishers.
The words "data" and "newsroom" spoken together conjure a data-driven approach to publishing: metrics and analytics. Chartbeat's primary focus has been on real-time actions – giving decision makers the tools they need, precisely when they need them. I'll confess that I spend a lot of time working on data capture for analytics and metrics at this granular, high-frequency time scale. However, my larger interests are data science as, well, as a science. Another association the words "data" and "newsroom" should conjure is data-driven journalism: using digital tools to gather and analyze information the way a scientists would.
To clarify a little, investigative journalism has always been about collecting data to find and tell a story. Here I'm talking about the more recent development of "the journalist as hacker." The sources invariably involve digital repositories, databases, or all-out web scraping. Data dumps are analyzed using spreadsheets, visualization tools like Tableau, or, at some point, turning to a programming language like Python.
These are the things that are traditionally not taught in J-school. Journalists who take this route to get the story are learning things on their own. Fortunately they are also sharing what they learn through blogs and conferences. It's an inspiring development to watch from the perspective of a more traditional data scientist, and I'd like to share some of my reading list.
I'll start off with what's closest to home. The New York Times has embraced a scientific approach to journalism to the extent of starting an R&D Lab. In lieu of long-form posts, their Twitter feed is a great way to stay current on what their up to, and a recent alumna blogged about her experience as a summer intern with the lab.
The output of the lab that readers are most likely to see are the great visualizations The Times consistently produces. Equally strong in this area is The Guardian. Their data blog is a way to keep current. Around the time of its inception they even wrote an article on Data journalism at the Guardian.
ProPublica is another great place to witness data journalism bloom, both with their Tools & Data page, as well as their Nerd Blog. Like The Times and The Guardian, they often build news applications, and in a recent post discuss good software practices in the newsroom. They take geek to the next level by open sourcing many of their tools on their GitHub page!
The methods from data science and software development aren't traditionally taught in Journalism school, but this may be changing. The Tow Center for Digital Journalism at Columbia University is offering a course they call Computational Journalism. The syllabus has broad overlap with the technologies behind data-driven web apps -- particularly social apps -- and suggested readings present a large cross-section of the state of data journalism today.
The site Data Driven Journalism, as the name implies, pulls many resources together, from tools for the data journalist in the field, to featured projects and interactive visualizations, to conference announcements. A product of MozFest 2011 was the Data Journalism Handbook, published by O'Reily but also distributed for free in web format. It's currently being translated and distributed in other languages.
From the past year, a couple projects stand out for their impact as investigative pieces as well their scale as data driven applications. First is the ICIJ's Secrecy for Sale project, in which they built an interactive database exposing links between offshore tax havens and the otherwise legitimate corporations and wealthy individuals who rely on them. This was a collaborative effort among 86 journalists, and they have written extensively about the techniques they used to collate leaked databases. Another project is Dollars for Docs by ProPublica, an interactive database exposing secret payments that pharmaceutical companies make to health care professionals.
As a data scientist and software engineer myself, I'll admit that I have a tendency to focus on tools and analyses. In the end, however, the aim is using data to tell a story. With journalism that is never truer. Quoting from the ProPublica blog, "The most important step in building a news app is thinking about the story... Humans don't parse data well, even if it looks pretty on a screen. A journalist's job is to take the data, put it into context, and explain why the data is newsworthy to begin with."