92 COMMUNICATIONS OF THE ACM | APRIL 2018 | VOL. 61 | NO. 4
tried several methods for estimating hyperparameters, but
the observed differences did not change the relative performance of algorithms. Gibbs sampling has worse coherence than the other algorithms, but produces more unique
words per topic. These patterns are consistent with semi-synthetic results for similarly sized corpora (details are in
For each NY Times topic learned by RecoverL2 we
find the closest Gibbs topic by l1 distance. The closest,
median, and farthest topic pairs are shown in Table 1.
We observe that when there is a difference, recover-based
topics tend to have more specific words (Anaheim Angels
5. CONCLUDING REMARKS
Here we have shown that algorithms based on the separa-
bility assumption are highly practical and produce topic
models of quality comparable to likelihood-based meth-
ods that use Gibbs sampling, while running in a fraction
of the time. Moreover, these algorithms are particularly
well-suited to parallel implementations, since each of the
major steps—with the exception of finding the anchor
words—can be trivially parallelized. Our algorithms inherit
provable guarantees from earlier approaches4 in the sense
that given samples from a topic model, the estimates prov-
ably converge to the true parameters at an inverse polyno-
mial rate. However an important question going forward
is to theoretically explain why these algorithms appear
to be (somewhat) robust to model misspecification. In
our experiments, we fit a topic model to real data and the
resulting topic model is not separable, but merely close
to being separable. Nevertheless, our algorithms recover
high quality topics in this setting too. Likelihood based
methods are known to be well-behaved when the model
is misspecified, and ideally one should be able to design
provable algorithms that not only have good running time
and sample complexity, but can also tolerate a realistic
amount of noise.
Since its publication, our algorithms have been extended
in a number of directions. Roberts et al.
24 consider applications in the social sciences and find that using an anchor-based model as an initialization for a likelihood-based
algorithm reduces variability and improves model fit.
Nguyen et al.
22 improve the topic recovery step by adding
regularization to smooth the estimated topic-word distributions, resulting in improved interpretability. A number
of authors have suggested new approaches to find anchor
words. Ding et al.
10 present a distributed algorithm that
can be parallelized across multiple servers. Zhou et al.
find anchor words by projecting rows of into the plane,
and selecting words that often appear as extreme points.
Lee and Mimno18 replace random projections with a single
heavy-tailed t-SNE projection that does not preserve pairwise l2 distances, but preserves local distances, allowing
points to be more spread out in the projected space. Ge and
Zou12 relaxed the anchor word assumption to a subset separable assumption that can hold even when anchor words
are not in a single topic, but a combination of a few topics.
Other recent work17 established criteria necessary for the
anchor factorization. Enforcing these criteria on the input
matrix through an initial rectification step substantially
improved model robustness, especially for small numbers
More broadly, the anchor words themselves have also
proven to be a useful tool in summarizing the meaning of a
topic and distinguishing a topic from related topics. When
coupled with the right visualization and analytic tools, it may
be possible to design semi-supervised learning algorithms
where a domain expert helps choose the final set of anchors.
It is also possible that anchor words will find applications
beyond text analysis, and will enable efficient algorithms in
other domains much the same way this assumption has in
RecoverL2 run inning game hit season zzz_anaheim_angel
Gibbs run inning hit game ball pitch
RecoverL2 father family zzz_elian boy court zzz_miami
Gibbs zzz_cuba zzz_miami cuban zzz_elian boy protest
RecoverL2 file sport read internet email zzz_los_angeles
Gibbs web site com www mail zzz_internet
Table 1. Example topic pairs from NY Times (closest l1), anchor
words in bold. The UCI N Y Times corpus includes named-entity annotations, indicated by the zzz prefix. All 100 topics are shown in the
Figure 9. Held-out probability (per token) is similar for RecoverL2
and Gibbs sampling. RecoverL2 has better coherence, but fewer
unique terms in the top N = 20 words than Gibbs. (Up is better for all
− 8. 5
− 7. 5
50 100 150 200 50 100 150 200
1. Ahmed, A., Aly, M., Gonzalez, J.,
Narayanamurthy, S., Smola, A.J.
Scalable inference in latent
variable models. In WSDM ‘12:
Proceedings of the fifth ACM
international conference on
Web search and data mining
(New York, NY, USA, 2012), ACM,
2. Anandkumar, A., Foster, D., Hsu, D.,
Kakade, S., Liu, Y. Two SVDs suffice:
Spectral decompositions for
probabilistic topic modeling and latent
dirichlet allocation. In NIPS (2012).
3. Arora, S., Ge, R., Kannan, R., Moitra, A.
Computing a nonnegative matrix
factorization—Provably. In STOC