of problems that machine learning
tackles. New computational answers to
these questions would be a significant
contribution to topic modeling.
visualization and user interfaces.
Another promising future direction for
topic modeling is to develop new methods of interacting with and visualizing
topics and corpora. Topic models provide new exploratory structure in large
collections—how can we best exploit
that structure to aid in discovery and
exploration?
One problem is how to display the
topics. Typically, we display topics by
listing the most frequent words of each
(see Figure 2), but new ways of labeling the topics—by either choosing
different words or displaying the chosen words differently—may be more
effective. A further problem is how to
best display a document with a topic
model. At the document level, topic
models provide potentially useful
information about the structure of the
document. Combined with effective
topic labels, this structure could help
readers identify the most interesting
parts of the document. Moreover, the
hidden topic proportions implicitly
connect each document to the other
documents (by considering a distance
measure between topic proportions).
How can we best display these connections? What is an effective interface
to the whole corpus and its inferred
topic structure?
These are user interface questions,
and they are essential to topic modeling. Topic modeling algorithms show
much promise for uncovering meaningful thematic structure in large collections of documents. But making
this structure useful requires careful
attention to information visualization
and the corresponding user interfaces.
Topic models for data discovery.
Topic models have been developed
with information engineering applica-
tions in mind. As a statistical model,
however, topic models should be able
to tell us something, or help us form
a hypothesis, about the data. What
can we learn about the language (and
other data) based on the topic model
posterior? Some work in this area has
appeared in political science, 19 biblio-
metrics, 17 and psychology. 32 This kind
of research adapts topic models to mea-
sure an external variable of interest, a
difficult task for unsupervised learning
that must be carefully validated.
Summary
We have surveyed probabilistic topic
models, a suite of algorithms that
provide a statistical solution to the
problem of managing large archives
of documents. With recent scientific
advances in support of unsupervised
machine learning—flexible components for modeling, scalable algorithms for posterior inference, and
increased access to massive datasets—
topic models promise to be an important component for summarizing and
understanding our growing digitized
archive of information.
References
1. asuncion, a., welling, m., smyth, P., teh, y. on
smoothing and inference for topic models. in
Uncertainty in Artificial Intelligence (2009).
2. bart, e., welling, m., Perona, P. unsupervised
organization of image collections: taxonomies and
beyond. Trans. Pattern Recognit. Mach. Intell. 33, 11
(2010) (2301–2315).
3. blei, D., griffiths, t., Jordan, m. the nested chinese
restaurant process and bayesian nonparametric
inference of topic hierarchies. J. ACM 57, 2 (2010), 1–30.
4. blei, D., Jordan, m. modeling annotated data. in
Proceedings of the 26th Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval (2003 ), acm Press, 127–134.
5. blei, D., lafferty, J. Dynamic topic models. in
International Conference on Machine Learning (2006),
acm, new york, ny, usa, 113–120.
6. blei, D., lafferty, J. a correlated topic model of
science. Ann. Appl. Stat., 1, 1 (2007), 17–35.
7. blei, D., mcauliffe, J. supervised topic models. in
Neural Information Processing Systems (2007).
8. blei, D., ng, a., Jordan, m. latent Dirichlet allocation.
J. Mach. Learn. Res. 3 (January 2003), 993–1022.
9. box, g. sampling and bayes’ inference in scientific
modeling and robustness. J. Roy. Stat. Soc. 143, 4
(1980), 383–430.
10. boyd-graber, J., blei, D. syntactic topic models. in
Neural Information Processing Systems (2009).
11. buntine, w. Variational extensions to em and
multinomial Pca. in European Conference on Machine
Learning (2002).
12. buntine, w., Jakulin, a. Discrete component analysis.
Subspace, Latent Structure and Feature Selection.
c. saunders, m. grobelink, s. gunn, and J. shawe-taylor,
eds. springer, 2006.
13. chang, J., blei, D. hierarchical relational models for
document networks. Ann. Appl. Stat. 4, 1 (2010).
14. Deer wester, s., Dumais, s., landauer, t., Furnas, g.,
harshman, r. indexing by latent semantic analysis. J.
Am. Soc. Inform. Sci. 41, 6 (1990), 391–407.
15. Doyle, g., elkan, c., accounting for burstiness in topic
models. in International Conference on Machine
Learning (2009), acm, 281–288..
16. Fei-Fei, l., Perona, P. a bayesian hierarchical model for
learning natural scene categories. in IEEE Computer
Vision and Pattern Recognition (2005), 524–531.
17. gerrish, s., blei, D. a language-based approach
to measuring scholarly impact. in International
Conference on Machine Learning (2010).
18. griffiths, t., steyvers, m., blei, D., tenenbaum, J.
integrating topics and syntax. Advances in Neural
Information Processing Systems 17. l. K. saul, y.
weiss, and l. bottou, eds. mit Press, cambridge, ma,
2005, 537–544.
19. grimmer, J. a bayesian hierarchical topic model for
political texts: measuring expressed agendas in senate
press releases. Polit. Anal. 18, 1 (2010), 1.
20. hoffman, m., blei, D., bach, F. on-line learning for
latent Dirichlet allocation. in Neural Information
Processing Systems (2010).
21. hofmann, t. Probabilistic latent semantic analysis.
in Uncertainty in Artificial Intelligence (UAI) (1999).
22. Jordan, m., ghahramani, Z., Jaakkola, t., saul, l.
introduction to variational methods for graphical
models. Mach. Learn. 37 (1999), 183–233.
23. li, J., wang, c., lim, y., blei, D., Fei-Fei, l., building and
using a semantivisual image hierarchy. in Computer
Vision and Pattern Recognition (2010).
24. li, w., mccallum, a. Pachinko allocation: Dag-structured mixture models of topic correlations. in
International Conference on Machine Learning (2006),
577–584.
25. mimno, D., mccallum, a. topic models conditioned on
arbitrary features with Dirichlet-multinomial regression.
in Uncertainty in Artificial Intelligence (2008).
26. newman, D., chemudugunta, c., smyth, P. statistical
entity-topic models. in Knowledge Discovery and Data
Mining (2006).
27. Pritchard, J., stephens, m., Donnelly, P. inference of
population structure using multilocus genotype data.
Genetics 155 (June 2000), 945–959.
28. reisinger, J., waters, a., silverthorn, b., mooney, r.
spherical topic models. in International Conference
on Machine Learning (2010).
29. rosen-Zvi, m., griffiths, t., steyvers, m., smith, P.,
the author-topic model for authors and documents. in
Proceedings of the 20th Conference on Uncertainty in
Artificial Intelligence (2004), auai Press, 487–494.
30. rubin, D. bayesianly justifiable and relevant frequency
calculations for the applied statistician. Ann. Stat. 12,
4 (1984), 1151–1172.
31. sivic, J., russell, b., Zisserman, a., Freeman, w., efros, a.,
unsupervised discovery of visual object class hierarchies.
in Conference on Computer Vision and Pattern
Recognition (2008).
32. socher, r., gershman, s., Perotte, a., sederberg, P., blei,
D., norman, K. a bayesian analysis of dynamics in free
recall. in Advances in Neural Information Processing
Systems 22. y. bengio, D. schuurmans, J. lafferty, c. K.
i. williams, and a. culotta, eds, 2009.
33. steyvers, m., griffiths, t. Probabilistic topic models.
Latent Semantic Analysis: A Road to Meaning.
t. landauer, D. mcnamara, s. Dennis, and w. Kintsch,
eds. lawrence erlbaum, 2006.
34. teh, y., Jordan, m., beal, m., blei, D. hierarchical
Dirichlet processes. J. Am. Stat. Assoc. 101, 476
(2006), 1566–1581.
35. wainwright, m., Jordan, m. graphical models,
exponential families, and variational inference. Found.
Trends Mach. Learn. 1( 1–2) (2008), 1–305.
36. wallach, h. topic modeling: beyond bag of words. in
Proceedings of the 23rd International Conference on
Machine Learning (2006).
37. wang, c., blei, D. Decoupling sparsity and
smoothness in the discrete hierarchical Dirichlet
process. Advances in Neural Information Processing
Systems 22. y. bengio, D. schuurmans, J. lafferty, c.
K. i. williams, and a. culotta, eds. 2009, 1982–1989.
38. wang, c., thiesson, b., meek, c., blei, D. markov topic
models. in Artificial Intelligence and Statistics (2009).
David M. Blei ( blei@cs.princeton.edu) is an associate
professor in the computer science department of
Princeton university, Princeton, n.J.
© 2012 acm 0001-0782/12/04 $10.00