Learning Topic Models —
Provably and Efficiently
By Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu
Today, we have both the blessing and the curse of being overloaded with information. Never before has text been more
important to how we communicate, or more easily available. But massive text streams far outstrip anyone’s ability
to read. We need automated tools that can help make sense
of their thematic structure, and find strands of meaning
that connect documents, all without human supervision.
Such methods can also help us organize and navigate large
text corpora. Popular tools for this task range from Latent
Semantic Analysis (LSA)
8 which uses standard linear algebra,
to deep learning which relies on non-convex optimization.
This paper concerns topic modeling which posits a simple
probabilistic model of how a document is generated. We give
a formal description of the generative model at the end of the
section, but next we will outline its important features.
Topic modeling represents each document as a bag of
words whereby all notions of grammar and syntax are discarded, and each document is associated with its vector of
word counts. The central assumption is that there is a fixed
set of topics—numbering, say, a couple hundred—that are
shared and recur in different proportions in each document. For example, a news article about legislation related
to retirement accounts might be represented as a mixture of
0.7 of the topic politics and 0.3 of the topic personal finance.
Furthermore, each topic induces a distribution on words in
the vocabulary. Note that a word like account can occur in
several topics: it could refer to a financial product (a bank
account) or a story (a fictional account), but the probability
that it is assigned would likely vary across topics. Finally,
the model specifies that each document is generated by first
picking its topic proportions from some distribution, and
then sampling each word from the document-specific distribution on words. In the above example, each word would
be picked independently from politics with probability 0.7
and from personal finance with probability 0.3. The goal in
topic modeling is, when given a large enough collection of
documents, to discover the underlying set of topics used to
generate them. Moreover, we want algorithms that are both
fast and accurate.
This generative model is a simplistic account of how
documents are created. Nevertheless, for a wide range of
applications in text analysis, methods based on this model
do indeed recover meaningful topics. We give an example of
a randomly chosen set of topics recovered by our algorithm,
when run on a collection of New York Times articles, as shown
in Figure 1. These tools have also found many applications
in summarization and exploratory data analysis. In fact, the
models described above are not just limited to text analysis
and have been used to recover semantic structure in various
biological datasets, including fMRI images of brain activity.
This work appeared as “A Practical Algorithm for Topic
Modeling with Provable Guarantees” (Arora, Ge, Halpern,
Mimno, Moitra, Sontag, Wu, Zhu) ICML 2013.
Variants of this model have also been used in linguistic and
humanities applications. See Ref. Blei5 for a thorough survey.
Traditional methods for learning the parameters of a
topic model are based on maximizing a likelihood objective.
Such approaches are popular when learning the parameters
of various other probabilistic models, too. However even in
the case of topic models with just two topics, this optimization problem is NP-hard.
4 At best, the approaches used in
practice are known to converge to the true solution eventually but we know of no good guarantees on the running time
needed to fit the parameters up to some desired precision.
These gaps in our understanding are not only a theoretical
issue but also a practical one: the seemingly large running
times of these algorithms means that learning 1000 topics
from 20 mn news articles requires a distributed algorithm
and 100 dedicated computers.
Recently, several groups of researchers have designed
new algorithms that have provable guarantees. These algorithms run in times that scale as a fixed polynomial in the
number of documents and the inverse of the desired precision.
2, 4 Our primary focus is on the algorithm of Arora et al.
which is based on a seemingly realistic assumption—termed
separability—about the structure of topics. The subsequent
work of Anandkumar et al.
2 removes this assumption, but
requires that the topics are essentially uncorrelated and
seems to be quite sensitive to violations of this assumption.
The contribution of the present article is to show that some
of these new theoretical algorithms can be adapted to yield
highly practical tools for topic modeling, that compete with
state of the art approximate likelihood approaches in terms
of the solution quality and run in a fraction of the time. At
the same time, the provable guarantees continue to hold for
our simplified algorithms.
anthrax, official, mail, letter, worker, attack
president, clinton, white_house, bush, official, bill_clinton
father, family, elian, boy, court, miami
oil, prices, percent, million, market, united_states
microsoft, company, computer, system, window, software
government, election, mexico, political, vicente_fox, president
fight, mike_tyson, round, right, million, champion
right, law, president, george_bush, senate, john_ashcroft
Figure 1. Examples of topics automatically extracted from
a collection of New York Times articles. Each row contains words
from one topic in descending order by probability.