Surveying a suite of algorithms that offer a
solution to managing large document archives.
By DaviD m. Blei
as OUr COLLeCTive knowledge continues to be
digitized and stored—in the form of news, blogs, Web
pages, scientific articles, books, images, sound, video,
and social networks—it becomes more difficult to
find and discover what we are looking for. We need
new computational tools to help organize, search, and
understand these vast amounts of information.
Right now, we work with online information using
two main tools—search and links. We type keywords
into a search engine and find a set of documents
related to them. We look at the documents in that
set, possibly navigating to other linked documents.
This is a powerful way of interacting with our online
archive, but something is missing.
Imagine searching and exploring documents
based on the themes that run through them. We might
“zoom in” and “zoom out” to find specific or broader
themes; we might look at how those themes changed
through time or how they are connected to each other.
Rather than finding documents through keyword
search alone, we might first find the theme that we
are interested in, and then examine the documents
related to that theme.
For example, consider using themes
to explore the complete history of the
New York Times. At a broad level, some
of the themes might correspond to
the sections of the newspaper—
foreign policy, national affairs, sports.
We could zoom in on a theme of interest, such as foreign policy, to reveal
various aspects of it—Chinese foreign
policy, the conflict in the Middle East,
the U.S.’s relationship with Russia. We
could then navigate through time to
reveal how these specific themes have
changed, tracking, for example, the
changes in the conflict in the Middle
East over the last 50 years. And, in all of
this exploration, we would be pointed
to the original articles relevant to the
themes. The thematic structure would
be a new kind of window through which
to explore and digest the collection.
But we do not interact with electronic archives in this way. While more
and more texts are available online, we
simply do not have the human power
to read and study them to provide the
kind of browsing experience described
above. To this end, machine learning
researchers have developed
probabilistic topic modeling, a suite of algorithms
that aim to discover and annotate large
archives of documents with thematic
information. Topic modeling algorithms are statistical methods that analyze the words of the original texts to
discover the themes that run through
them, how those themes are connected
to each other, and how they change over
topic models are algorithms for
discovering the main themes that
pervade a large and otherwise
unstructured collection of documents.
topic models can organize the collection
according to the discovered themes.
topic modeling algorithms can be applied
to massive collections of documents.
Recent advances in this field allow us to
analyze streaming collections, like you
might find from a Web aPi.
topic modeling algorithms can be
adapted to many kinds of data. among
other applications, they have been used
to find patterns in genetic data, images,
and social networks.