learned distributions. In particular, given a learned matrix Â
and the true matrix A, we use bipartite matching to align topics, and then evaluate the l1 distance between each pair of
topics. When true parameters are not available, a standard
evaluation for topic models is to compute held-out probability,
the probability of previously unseen documents under the
Topic models are useful because they provide inter-
pretable latent dimensions. We can evaluate the seman-
tic quality of individual topics using a metric called
21 This metric has been shown to correlate well
with human judgments of topic quality. If we perfectly
reconstruct topics, all the high-probability words in a
topic should co-occur frequently, otherwise, the model
may be mixing unrelated concepts. Given a set of words
W, coherence is
where D(w) and D(w1, w2) are the number of documents
with at least one instance of w, and of w1 and w2, respectively. We set ε = 0.01 to avoid taking the log of zero for
words that never co-occur. Coherence measures the quality
of individual topics, but does not measure redundancy, so
we measure inter-topic similarity. For each topic, we gather
the set of the N most probable words. We then count how
many of those words do not appear in any other topic’s set
of N most probable words. For these experiments we use
N = 20. Some overlap is expected due to semantic ambiguity, but lower numbers of unique words indicate less useful models.
4. 2. Efficiency
Both the Recover and RecoverL2 algorithms, in Python,
are faster than a heavily optimized Gibbs sampling implementation in Java. Figure 5 shows the time to train models
on synthetic corpora on a single machine. Gibbs sampling is
linear in the corpus size. RecoverL2 is also linear (ρ = 0.79),
but only varies from 33 to 50sec. Estimating Q is linear, but
takes only 7sec for the largest corpus. FindAnchors takes
less than 6sec for all corpora.
4. 3. Semi-synthetic documents
The new algorithms have good l1 reconstruction error on
semi-synthetic documents, especially for larger corpora. Results
for semi-synthetic corpora drawn from topics trained on NY
Times articles are shown in Figure 6 (top) for corpus sizes
ranging from 50k to 2M synthetic documents. In addition,
topic distributions) and real documents. In addition, we
measure the effect of different sources of error and model
4. 1. Methodology
We train models on two synthetic data sets to evaluate
performance when model assumptions are correct, and
on real documents to evaluate real-world performance.
To ensure that synthetic documents resemble the dimensionality and sparsity characteristics of real data, we
generate semi-synthetic corpora. For each real corpus,
we train a model using Gibbs sampling and then generate new documents using the parameters of that model
(these parameters are not guaranteed to be separable; we
found that about 80% of topics fitted by Gibbs sampling
had anchor words).
We use two real-world data sets, a large corpus of New York
Times articles (295k documents, vocabulary size 15k, mean
document length 298) and a small corpus of Neural
Information Processing Systems (NIPS) abstracts (1100
documents, vocabulary size 2500, mean length 68).
Vocabularies were pruned with document frequency cutoffs. We generate semi-synthetic corpora of various sizes
from models trained with K = 100 from NY Times and NIPS,
with document lengths set to 300 and 70, respectively, and with
document-topic distributions drawn from a Dirichlet with
symmetric hyperparameters 0.03.
For the first stage of the algorithm, anchor word recovery,
we use the FindAnchors algorithm in all cases. The original linear programming-based anchor word finding method
presented with Recover in Arora et al.,
4 is too slow to be comparable. For Gibbs sampling we obtain the word-topic distributions by averaging over 10 saved states, each separated by
100 iterations, after 1000 burn-in iterations.
We use a variety of metrics to evaluate the learned models.
For the semi-synthetic corpora, we compute the
reconstruction error between the true word-topic distributions and the
Figure 4. The first three steps of FindAnchors consist of finding a
starting point furthest from the origin, finding the furthest point from
the initial point, and finding the furthest point from the line defined
by the first two points.
0 25000 50000 75000 100000
d s Gibbs
Figure 5. Training time on synthetic NIPS documents.