Science | DOI: 10.1145/1859204.1859210
Gary Anthes
Topic models Vs.
unstructured Data
With topic modeling, scientists can explore and
understand huge collections of unlabeled information.
TopiC ModelinG, an amal- gam of ideas drawn from computer science, math- ematics, and cognitive sci- ence, is evolving rapidly to
help users understand and navigate
huge stores of unstructured data.
Topic models use Bayesian statistics
and machine learning to discover the
thematic content of unlabeled documents, provide application-specific
roadmaps through them, and predict
the nature of future documents in a
collection. Most often used with text
documents, topic models can also be
applied to collections of images, music, DNA sequences, and other types of
information.
Because topic models can discover
the latent, or hidden, structure in documents and establish links between
documents, they offer a powerful new
way to explore and understand information that might otherwise seem chaotic and unnavigable.
The base on which most probabilistic topic models are built today
is latent Dirichlet allocation (LDA).
Applied to a collection of text docu-
Latent Dirichlet allocation.
dirichlet
parameter
Per-word
topic assignment
ments, LDA discovers “topics,” which
are probability distributions over
words that co-occur frequently. For
example, “software,” “algorithm,”
and “kernel” might be found likely
to occur in articles about computer
science. LDA also discovers the probability distribution of topics in a document. For example, by examining the
word patterns and probabilities, one
article might be tagged as 100% about
computer science while another
might be tagged as 10% computer science and 90% neuroscience.
LDA algorithms are built on assump-
tions of how a “generative” process
might create a collection of documents
from these probability distributions.
The process does that by first assigning
to each document a probability distri-
bution across a small number of top-
ics from among, say, 100 possible top-
ics in the collection. Then, for each of
these hypothetical documents, a topic
is chosen at random (but weighted
by its probability distribution), and a
word is generated at random from that
topic’s probability distribution across
the words. This hypothetical process is
repeated over and over, each word in a
document occurring in proportion to
the distribution of topics in the docu-
ment and the distribution of words in
a topic, until all the documents have
been generated.
Per-document
topic proportions
observed
word
Topics
Topic
hyperparameter
α
θd
Zd,n Wd,n
βk
η
N
DK
more modular, more Scalable
LDA is essentially a technical refinement—making it more modular and
scalable—of the topic modeling technique called probabilistic latent semantic indexing. Introduced in 1999
by Jan Puzicha and Thomas Hofmann,
probabilistic latent semantic indexing was derived from Latent Semantic
Indexing, which was developed in the
late 1980s by Scott Deerwester, Susan
T. Dumais, George W. Furnas, Thomas