Data journalism
Data journalism or data-driven journalism is journalism based on the filtering and analysis of large data sets for the purpose of creating or elevating a news story.
Data journalism reflects the increased role of numerical data in the production and distribution of information in the digital era. It involves a blending of journalism with other fields such as data visualization, computer science, and statistics, "an overlapping set of competencies drawn from disparate fields".
Data journalism has been widely used to unite several concepts and link them to journalism. Some see these as levels or stages leading from the simpler to the more complex uses of new technologies in the journalistic process.
Many data-driven stories begin with newly available resources such as open source software, open access publishing and open data, while others are products of public records requests or leaked materials. This approach to journalism builds on older practices, most notably on computer-assisted reporting, a label used mainly in the US for decades. Other labels for partially similar approaches are "precision journalism", based on a book by Philipp Meyer, published in 1972, where he advocated the use of techniques from social sciences in researching stories. Data-driven journalism has a wider approach. At the core the process builds on the growing availability of open data that is freely available online and analyzed with open source tools. Data-driven journalism strives to reach new levels of service for the public, helping the general public or specific groups or individuals to understand patterns and make decisions based on the findings. As such, data-driven journalism might help to put journalists into a role relevant for society in a new way.
Telling stories based on the data is the primary goal. The findings from data can be transformed into any form of journalistic writing. Visualizations can be used to create a clear understanding of a complex situation. Furthermore, elements of storytelling can be used to illustrate what the findings actually mean, from the perspective of someone who is affected by a development. This connection between data and story can be viewed as a "new arc" trying to span the gap between developments that are relevant, but poorly understood, to a story that is verifiable, trustworthy, relevant and easy to remember.
Definitions
Veglis and Bratsas defined data journalism as "the process of extracting useful information from data, writing articles based on the information, and embedding visualizations in the articles that help readers understand the significance of the story or allow them to pinpoint data that relate to them"Antonopoulos and Karyotakis define the practice of data journalism as "a way of enhancing reporting and news writing with the use and examination of statistics in order to provide a deeper insight into a news story and to highlight relevant data. One trend in the digital era of journalism has been to disseminate information to the public via interactive online content through data visualization tools such as tables, graphs, maps, infographics, microsites, and visual worlds. The in-depth examination of such data sets can lead to more concrete results and observations regarding timely topics of interest. In addition, data journalism may reveal hidden issues that seemingly were not a priority in the news coverage".
According to architect and multimedia journalist Mirko Lorenz, data-driven journalism is primarily a workflow that consists of the following elements: digging deep into data by scraping, cleansing and structuring it, filtering by mining for specific information, visualizing and making a story. This process can be extended to provide results that cater to individual interests and the broader public.
Data journalism trainer and writer Paul Bradshaw describes the process of data-driven journalism in a similar manner: data must be found, which may require specialized skills like MySQL or Python, then interrogated, for which understanding of jargon and statistics is necessary, and finally visualized and mashed with the aid of open-source tools.
A more results-driven definition comes from data reporter and web strategist Henk van Ess. "Data-driven journalism enables reporters to tell untold stories, find new angles or complete stories via a workflow of finding, processing and presenting significant amounts of data with or without open tools." Van Ess claims that some of the data-driven workflow leads to products that "are not in orbit with the laws of good story telling" because the result emphases on showing the problem, not explaining the problem. "A good data driven production has different layers. It allows you to find personalized that are only important for you, by drilling down to relevant but also enables you to zoom out to get the big picture."
In 2013, Van Ess came with a shorter definition in that doesn't involve visualisation per se:"Data journalism can be based on any data that has to be processed first with tools before a relevant story is possible. It doesn't include visualization per se."
However, one of the problems for defining data journalism is that many definitions are not clear enough and focus on describing the computational methods of optimization, analysis, and visualization of information.
Emergence as a concept
The term "data journalism" was coined by political commentator Ben Wattenberg through his work starting in the mid-1960s layering narrative with statistics to support the theory that the United States had entered a golden age.One of the earliest examples of using computers with journalism dates back to a 1952 endeavor by CBS to use a mainframe computer to predict the outcome of the presidential election, but it wasn't until 1967 that using computers for data analysis began to be more widely adopted.
Working for the Detroit Free Press at the time, Philip Meyer used a mainframe to improve reporting on the riots spreading throughout the city. With a new precedent set for data analysis in journalism, Meyer collaborated with Donald Barlett and James Steele to look at patterns with conviction sentencings in Philadelphia during the 1970s. Meyer later wrote a book titled Precision Journalism that advocated the use of these techniques for combining data analysis into journalism.
Toward the end of the 1980s, significant events began to occur that helped to formally organize the field of computer assisted reporting. Investigative reporter Bill Dedman of The Atlanta Journal-Constitution won a Pulitzer Prize in 1989 for The Color of Money, his 1988 series of stories using CAR techniques to analyze racial discrimination by banks and other mortgage lenders in middle-income black neighborhoods. The National Institute for Computer Assisted Reporting was formed at the Missouri School of Journalism in collaboration with the Investigative Reporters and Editors. The first conference dedicated to CAR was organized by NICAR in conjunction with James Brown at Indiana University and held in 1990. The NICAR conferences have been held annually since and is now the single largest gathering of data journalists.
Although data journalism has been used informally by practitioners of computer-assisted reporting for decades, the first recorded use by a major news organization is The Guardian, which launched its Datablog in March 2009. And although the paternity of the term is disputed, it is widely used since Wikileaks' Afghan War documents leak in July, 2010.
The Guardian coverage of the war logs took advantage of free data visualization tools such as Google Fusion Tables, another common aspect of data journalism. Facts are Sacred by The Guardian Datablog editor Simon Rogers describes data journalism like this:
Investigative data journalism combines the field of data journalism with investigative reporting. An example of investigative data journalism is the research of large amounts of textual or financial data. Investigative data journalism also can relate to the field of big data analytics for the processing of large data sets.
Since the introduction of the concept a number of media companies have created "data teams" which develop visualizations for newsrooms. Most notable are teams e.g. at Reuters, Pro Publica, and La Nacion. In Europe, The Guardian and Berliner Morgenpost have very productive teams, as well as public broadcasters.
As projects like the MP expense scandal and the 2013 release of the "offshore leaks" demonstrate, data-driven journalism can assume an investigative role, dealing with "not-so open" aka secret data on occasion.
The annual Data Journalism Awards recognize outstanding reporting in the field of data journalism, and numerous Pulitzer Prizes in recent years have been awarded to data-driven storytelling, including the 2018 Pulitzer Prize in International Reporting and the 2017 Pulitzer Prize in Public Service
Taxonomies
Many scholars have proposed different taxonomies of data journalism projects. Megan Knight suggested a taxonomy that is based on the level of interpretations and analysis that is needed in order to produce a data journalism project. Specifically the taxonomy included: number pullquote, static map, list and timelines, table, graphs and charts, dynamic map, textual analysis, and info graphics.Simon Rogers proposed five types of data journalism projects: By just the facts, Data-based news stories, Local data telling stories, Analysis and background, and Deep dive investigations. Martha Kang discussed seven types of data stories, namely: Narrate change over time, Start big and drill down, Start small and zoom out, Highlight contrasts, Explore the intersection, Dissect the factors, and Profile the outliers.
Veglis and Bratsas proposed another taxonomy that is based on the method of presenting the information to the audience. Their taxonomy had an hierarchical structure and included the following types: data journalism articles with just numbers, with tables, and with visualizations. Also in the case of stories with interactive visualizations they proposed 3 distinct types, namely transmissional, consultational, and conversational.