Today, matching data sets never
intended to be matched is standard
fare in newsrooms; stories highlighting child-care providers with felony records and voting rolls populated with
the names of the dead are examples
of the genre. The technique requires
painstaking data cleansing and verification to deal with ambiguous identities and errors. Public records almost
never include Social Security numbers,
dates of birth, or other markers that
would provide more accurate joins.
But traditional news organizations
have been willing to devote their time
because they view documenting the
failure of government regulation, unintended consequences of programs,
and influence-peddling as core elements of their public-service mission.
However, such public-affairs reporting is increasingly at risk due to the decline in revenue and reporting staff in
traditional news organizations—and
is where the field of computational
journalism can help the most. By developing techniques, methods, and
user interfaces for exploring the new
landscape of information, computer
scientists can help discover, verify, and
even publish new public-interest stories at lower cost. Some of this work
requires developing brand-new technology, much of it involving work on
new user interfaces for existing methods and some on simple repurposing.
Technologies and algorithms already
developed for informatics, medicine,
law, security, and intelligence operations, the social and physical sciences,
and the digital humanities all promise
to be exceptionally useful in public-affairs and investigative reporting. At
the same time, coupling the promised
increased availability of government
information with easy-to-use interfaces can aid nonprofessional citizen-journalists, non-governmental organizations, and public-interest groups in
their own news gathering.
understanding news Data
For computationalists and journalists
to work together to create a new gener-
ation of reporting methods, each needs
an understanding of how the other
views “data.” Like intelligence and law-
enforcement analysts, reporters focus
on administrative records and collec-
tions of far-flung original documents
rather than anonymous or aggregated
organized data sets. Structured data-
bases of public records (such as cam-
paign contributions, farm-subsidy
payments, and housing inspections)
generate leads and provide context,
sometimes documenting wrongdoing
or unintended consequences of gov-
ernment regulation or programs. But
most news stories depend as much or
more on collections of public and in-
ternal agency documents, audio and
video recordings of government pro-
ceedings, handwritten forms, recorded
interviews, and reporters’ notes col-
lected piece-by-piece from widely dis-
parate sources. Some (such as press
releases and published reports) are
born digital; others are censored and
scanned to images before being re-
leased to the public.