Doi: 10.1145/1897816.1897842
The Sequence Memoizer
By Frank Wood, Jan Gasthaus, Cédric Archambeau, Lancelot James, and Yee Whye Teh
Abstract
Probabilistic models of sequences play a central role in
most machine translation, automated speech recognition,
lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies
which can only be captured by computationally challenging, complex models. Sequence data arising from natural
processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The
sequence memoizer is a new hierarchical Bayesian model
for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining
computationally attractive. Its utility as a language model
and general purpose lossless compressor is demonstrated.
1. intRoDuction
It is an age-old quest to predict what comes next in
sequences. Fortunes have been made and lost on the success and failure of such predictions. Heads or tails? Will
the stock market go up by 5% tomorrow? Is the next card
drawn from the deck going to be an ace? Does a particular
sequence of nucleotides appear more often then usual in a
DNA sequence? In a sentence, is the word that follows the
United going to be States, Arab, Parcel, Kingdom, or something else? Using a probabilistic model of sequences fit to a
particular set of data is usually an effective way of answering
these kinds of questions.
Consider the general task of sequence prediction. For
some sequences, the true probability distribution of the
next symbol does not depend on the previous symbols in
the sequence. For instance, consider flipping a coin that
comes up heads with probability p or tails with probability
1 − p every time it is flipped. Subsequent flips of such a coin
are completely independent and have the same distribution
(in statistics, such coin flips are called independent and identically distributed [iid]). In particular, heads will occur with
probability p irrespective of whether previous flips have
come up heads or tails. Assuming we observe a sequence of
such coin flips, all we need to do is to estimate p in order to
fully characterize the process that generated the data.
For more interesting processes, the distribution of the
next symbol often depends in some complex way on previous
outcomes. One example of such a process is natural language
(sequences of words). In English, the distribution of words
that follow the single-word context United is quite different
from the distribution of words that follow Cricket and rugby
are amongst the most popular sports in the United. In the first
case, the distribution is relatively broad (though not nearly as
broad as the distribution given no context at all), giving signifi-
cant probability to words such as States, Kingdom, Airlines,
and so forth, whereas in the second case, the distribution is
almost certainly highly peaked around Kingdom. Information
from distant context (Cricket and rugby) impacts the distri-
bution of the next word profoundly. Production of natural
language is but one example of such a process; the real world
is replete with other examples.
2. PReDictinG seQuences
To start, let S be the set of symbols that can occur in some
sequence. This set can consist of dictionary entries, ASCII
values, or {A, C, G, T} in case of DNA sequences. Suppose
that we are given a sequencea x = x1, x2, …, xT of symbols
The work reported in this paper originally appeared
in “A Hierarchical Bayesian Language Model based on
Pitman–Yor Processes,” published in the Proceedings
of International Conference on Computational Linguistics
and the Association for Computational Linguistics, 2006;
“A Stochastic Memoizer for Sequence Data,” published in
the Proceedings of the International Conference on Machine
Learning, 2009 and “Lossless compression based on the
Sequence Memoizer” published in the Proceedings of the
IEEE Data Compression Conference, 2010.