APRIL 2018 | VOL. 61 | NO. 4 | COMMUNICATIONS OF THE ACM 87
statistics measuring how often various pairs of words co-occur in a document. Recall that the matrix Q denotes the
co-occurrence probabilities of pairs of words. In this section
it is more convenient to consider the conditional probabilities , where i,j is the probability of the second word being
j conditioned on the first word being i. The matrix is just
a row normalized version of Q whose rows sum up to 1.
It is useful to consider this data geometrically. We can view
the rows of as points in V-dimensional space. Moreover
we will call a row of an anchor row if it corresponds to an
anchor word. A simplified illustration of anchor and non-anchor rows is given in Figure 3. The key insight behind our
algorithm is the following fact. Recall that a vector u is said to
be in the convex hull of vectors v1, v2, . . . vd if it can be written
as u = Σi λivi where the λi’s are nonnegative and sum to one.
Lemma 1. If the topic matrix is separable, then each row of
is in the convex hull of the anchor rows.
This geometric property motivates our simple, greedy algorithm for identifying the anchor words. First we sketch a
proof of this lemma through elementary manipulations on
various conditional probabilities.
Finally let A denote the V × K matrix whose columns are A1,
A2, . . ., AK. Then it can be shown that Q = ARAT. Suppose for the
moment that we could accurately estimate the entries of Q.
A naive attempt to apply the method of moments runs
into its own computational difficulties when one attempts
to solve the system of non-linear equations. In particular, we
are faced with a matrix decomposition problem where our
goal is to express Q as a product of entry-wise non-negative
matrices as above. This is closely related to the nonnegative
matrix factorization problem and is known to be NP-hard.
The approach of Arora et al.
4 is to make use of algorithms
that solve nonnegative matrix factorization under a certain
assumption that seems natural in the context of topic modeling. We describe this assumption and its rationale next.
1. 4. Separability
The guiding assumption behind the algorithm of Arora
4 is a notion called separability.
11 More precisely, this
assumption stipulates that topics can be reliably distinguished from one another via anchor words—which, in the
context of topic models, are specialized words that are specific to a single topic. For example, if the word 401k occurs in
a document then it is a strong indicator that the document
is at least partially about personal finance. Natural language
seems to contain many such unambiguous words. The condition of separability requires that each topic contains at
least one (unknown) anchor word. We provide various experiential evidence showing that models fit to real-world data
sets contain many anchor words.
Arora et al.
3 gave an algorithm for solving nonnegative
matrix factorization under the separability assumption. In a
subsequent paper, Arora et al.
4 showed that such an algorithm
can be used to provably learn the parameters of a separable
topic model. While theoretically important, these algorithms
(as stated) were far from practical: the runtime is a large polynomial, and the algorithm itself is sensitive to violations of
the modeling assumptions, learning poor quality topics
when run on real-world data collections. The current paper
addresses these issues, presenting a variant of the above
algorithm that achieves state of the art performance and runs
orders of magnitude faster than approximate likelihood based
approaches. Along the way, we also give a faster algorithm for
solving separable nonnegative matrix factorization.
We remark that separability is not the only assumption
that allows for polynomial time recovery of topic-models.
Anandkumar et al.
2 give a provable algorithm for topic modeling based on third-order moments and tensor decomposition that does not require separability but instead requires
that topics are essentially uncorrelated. Although standard
topic models like LDA7 assume this property, there is strong
evidence that real-world topics are dependent.
6, 19 For example, the topics economics and politics are more likely to co-occur than economics and cooking.
2. THE ANCHOR WORDS ALGORITHM
2. 1. From probability to geometry
Separable topic models have various important probabilistic and geometric properties. These properties will form
the foundation for our algorithm. We will work with simple
= Anchor words
= Regular words
Figure 3. The rows of Q are vectors in V-dimensions, and their
convex hull is a K-simplex whose vertices are anchor rows. Here
and this implies that the posterior distribution
P(z1 = *| w1 = i) assigns to z1 = k and to z1 = k¢ and zero to every
Algorithm 1. FindAnchors
1: Compute co-occurrences. Let Nd be the length of document d, and Nd(i) be the number of occurrences of word i
in document d.
2: Let be a row normalization of ˆ Q. Rows of sum up
3: for k=1TO Kdo
4: Choose the row in furthest from the affine span of
the anchor rows chosen so far.
5: end for
6: Return the chosen anchor words