their competitors and clients; and citizens who want to know everything their
elected officials have done in the past
week. Search and retrieval technology
may be up this task, but the existing
free and open source user interfaces to
the technology remain crude and fail
to address the variety of sources where
finding answers to queries is less important than exploring what’s new.
Information extraction. Most information collected by journalists arrives
as unstructured text, but most of their
work involves reporting on people and
places. A beat reporter might cover one
or more counties, a subject, an industry, or a group of agencies.
Most of the documents they obtain
would benefit from entity extraction.
Thomson Reuters allows the public to
use its OpenCalais service (http://www.
opencalais.com/), and at least a half-dozen open source and academic enti-ty-extraction tools have been available
for several years. The intelligence community and corporations depend on
this basic but relatively new technique.
But effective use of these tools requires
computational knowledge beyond that
of most reporters, documents already
organized, recognized, and formatted,
or an investment in commercial tools
typically beyond the reach of news outlets in non-mission-critical functions.
Being able to analyze and visualize
interactions among entities within and
even outside a document collection—
whether from online sources or boxes
of scanned paper—would give stories
more depth, reduce the cost of reporting, and expand the potential for new
stories and new leads.
Document exploration and redundancy.
There are two areas—finding what’s
new and mining accumulated documents—in which the ability to group
documents in interesting ways would
immediately reduce the time and effort
Audiences, editors, and producers
expect reporters to know what has been
published on their beats in real time.
Reporters need to notice information
that is not commonly known but that
could lead to news in interviews, documents, and other published sources.
The recent explosion in blogs, aggregated news sites, and special-interest-group compilations of information
makes distinguishing new stories time
Journalists look for
the unusual handful
of individual items
that might point
toward a news story
or an emerging
consuming and difficult. Collections
of RSS feeds might comprise hundreds
of stories with the same information.
In our interviews with journalists,
we were told this challenge is more
difficult than it seems for reporters
lacking technical knowledge. But solving it would immediately reduce the
amount of time spent distinguishing
“commodity news,” or news widely
known and therefore uninteresting,
from news their audience might not
know or items that could prompt further reporting.
Another scenario arises in the collections of documents and data accumulated in a long investigative project.
In some cases, existing search tools are
not robust enough to find the patterns
journalists might seek. For example, in
2006, reporters at the New York Times
used more than 500 different queries
to find earmarks for religious groups in
In other cases, simply exploring a
collection of documents might suggest
further work if grouping them would
help identify patterns. For example,
in June 2010, the William J. Clinton
Presidential Library released more
then 75,000 pages of memoranda,
email messages, and other documents
related to Supreme Court nominee
Elena Kagan. Grouping them in various ways might help better identify her
interests, political leanings, and areas
where she disagreed with others in the
White House and suggest stories that
could be missed simply by reading and
searching the collection.
Combining these projects—
content aggregation, entity extraction,
and clustering of documents—could
provide breakthrough innovation in
investigative reporting. Together, they
would directly address the key problem
faced by most news consumers, as well
as by producers: too much material too
difficult to obtain containing too little
information. These advances might allow for efficient, effective monitoring
of powerful institutions and people
and reduce the mind-numbing repetition and search in-depth reporting often requires.
Audio and video indexing. Public records increasingly consist of audio and
video recordings, often presented as
archived Webcasts, including government proceedings, testimony, hear-