ings, and civil- and criminal-court trials. Unless a third party has already
transcribed, closed-captioned, or applied speech-recognition techniques
on the record, most reporters have no
way to search even a rough transcript.
In addition, many reporters record
many of their interviews digitally but
rarely have useful speech-recognition
software to index them. Basic consumer software products (such as Dragon-speech from Nuance) work on simple,
short recordings or trained voices. Other promising projects (such as Google’s
Audio Indexing, or GAUDi) are not publicly available. GPS and voice recognition on mobile phones and voice mail
could make reporters think solving
their problem is simple.
Reporters could make near-daily
use of technology and a user interface
that would provide approximate indexing of a variety of voices and conditions, leading them to the portions
they most want to review. They do not
require the accuracy of, say, e-discov-ery by lawyers or official government
records. Instead, they want a quick way
to move to the portion of a recording
that contains what may be of interest,
then carefully review and transcribe it.
Existing technology is probably adequate for reporters’ immediate needs,
but we are unable to find reasonably
simple user interfaces to the technology that would allow unsophisticated
users to test the technology on their
own recordings.
Extracting data from forms and re-
ports. Much of the information col-
lected by reporters arrives in two
genres: original forms submitted to
or created by government agencies,
often handwritten, and reports gener-
ated from larger systems, sometimes
electronically and sometimes on pa-
per. Examples include financial dis-
closure statements of elected officials,
death certificates, safety inspections,
sign-in sheets at government check-
points and police incident reports.
Journalists have few choices today:
retype key documents into a database;
attempt to search recognized images;
or simply read them and take notes.
An in-house programmer can occa-
sionally find the pattern of digital re-
ports intended for printing that can
be leveraged to reverse them back into
a structured database, but this time-
consuming job requires skill well be-
yond nearly all reporters.
new tools, new organizations
A handful of new services have emerged
to help address journalism’s data challenges. Usually free for small-scale or
non-commercial use, they facilitate
analysis, visualization, and presentation of structured data: Google Refine
promises to let reporters scrap their
spreadsheets for filtering, viewing, and
cleaning basic data sets; ManyEyes
from IBM lets news organizations visualize and share data on their Web sites;
Tableau Public from Tableau Software,
Google Earth, and other such products
are routinely used by news organizations to generate and publish visualizations. New tools (such as TimeFlow
developed at Duke University as an
investigative tool for temporal analysis) are being created to address some
longstanding needs of reporters.
4
Another set of tools created for
other purposes, often experimental
or academic, shows promise for the
fast-paced, ad hoc nature of reporting
challenges. Several political-science
scholars have created tools for clustering legislation and other public documents; homeland-security developers
have created tools (such as Georgia
Tech’s Jigsaw15) for visualizing the connections among documents; and the
CMU Sphinx project3 has created reasonably accurate open source speech-recognition technology. Applications
developed for intelligence, law enforcement, and fraud investigations by such
companies as Palantir Technologies
and I2 are expensive and finely tuned
to specific industries, though they address similar challenges on a different
scale and with different requirements
for speed and accuracy.
DocumentCloud ( http://www.docu-
mentcloud.org), a nonprofit founded
in 2009 by journalists at the New York
Times and ProPublica, hopes to ad-
dress one of the most vexing issues in
documents reporting: scanned images
files. With it, reporters can annotate
their documents as they find inter-
esting or questionable sections and
see which entities appear in multiple
documents. At this writing, most news
organizations have used it most ef-
fectively to publish government docu-
ments. But the project, which includes
information extraction as a standard
feature, shows great promise helping
address some of the problems of di-
gesting large document collections.