markov meets Bayes
By Fernando Pereira
ProBaBiLis TiC seqUenCe ModeLs have become so pervasive in science and engineering that we often act as if the
sequence modeling problem is basically solved. On the Internet, such
models power the large-scale systems
for statistical machine translation,
such as Google Translate, that receive
millions of queries each day. These
systems use a probabilistic sequence
model—a language model—to select
among candidate translations in the
target language. For example, consider translating the English phrases
“to show interest” and “to charge interest” into Portuguese. In the first
phrase, the word “interest” should be
translated as “interesse” (curiosity)
while in the second it should be translated as “juro” (return on lending).
In each case, a Portuguese language
model would assign higher probability to the correct translation based
on the context established by the preceding verb. Today, such language
models are derived from the statistics
of large corpora with as many as 1010
sentences of text. Moreover, in practical systems for machine translation or
speech recognition, they may assign
probabilities to sequences that contain as many as 106 distinct words.
The history of probabilistic se-
quence models, too long and complex
to review here, dates back to Markov
at the turn of the last century. To a
first approximation, today’s practice
is to memorize a subset of cw pat-
terns occurring in the training data,
where w is a token (for example, Uni-
code point, DNA base, word) and c is
a context of preceding tokens. The
statistics of these patterns are then
used to estimate the probability of
a token appearing after a given se-
quence of tokens. For modelers, the
critical modeling questions are how
to select which patterns to store, and
from these patterns, how to estimate
the probability that a token w follows
a context c when w has never or rarely
been seen in that context. Though in-
formed by decades of research, cur-
rent practices are still something of
a black art. They often reflect the par-
ticular data structures and algorithms
used to create sequence models and,
for large-scale systems, the need for
distribution implementations. We
practitioners seem for the most part
resigned to these complications, al-
though naturally we wonder if there
are more elegant ways to manage the
balance between memorizing old pat-
terns and generalizing to new ones.
1. Chelba, C. A structured language model. In
Proceedings of the 35th Annual Meeting of the ACL
(Morristown, NJ, 1997), Assoc. Computational
2. Cleary, J.G., and Teahan, W.J. Unbounded length
contexts for PPM. The Computer J. 40 (1997), 67–75.
3. Lin, D. An information-theoretic definition of similarity.
In Proceedings of the 15th International Conference
on Machine Learning (1998), 296–304; http:dblp.uni-trier.de/.
4. Mochihashi, D., and Sumita, E. The infinite Markov
model. Advances in Neural Information Processing
Systems 20 (2008), 1017–1024.
5. Willems, F., Shtarkov, Y., and Tjalkens, T. The context-tree weighting method: Basic properties. IEEE Trans.
Info. Theory 41, 3 (May 1995), 653–664.
Fernando Pereira is research director at Google, Menlo
© 2011 ACM 0001-0782/11/0200 $10.00