in the compression literature. 4, 19
5. 3. inference and prediction
As a consequence of the two marginalization steps described
in the previous subsection, inference in the full SM model
with an infinite number of parameters is equivalent to inference in the compact context tree ˆ T (x) with a linear number
of parameters. Further, the prior over the conditional distributions on ˆ T (x) still retains the form of a hierarchical PYP:
each node still has a PYP prior with its parent as the base
distribution. This means that inference algorithms developed for the finite-order hierarchical PYP model can be easily adapted to the SM. We will briefly describe the inference
algorithms we employ.
In the SM model we are mainly interested in the predictive distribution of the next symbol being some s Î S given
some context u, conditioned on an observed sequence x.
As in Equation 2, this predictive distribution is expressed
as an expectation E[Gu(s)] over the posterior distribution
of {Gu¢}u¢ÎˆT(x). Just as in Equation 3 as well, it is possible
to express E[Gu(s)] as an expectation over a set of random
counts {N(u¢s¢), M(u¢s¢)}u¢ ΈT (x), s¢ ÎS :
Again, the first term in the numerator can be interpreted as
a count of the number of times s occurs in the context u, the
second term is the reduction applied to the count, while the
third term spreads the total reduction across S according to
the base distribution Gs(u)(s). Each context u now has its own
discount parameter au, which is the product of discounts
on the non-branching chain leading to u on T (x), while the
parent s (u) is the head of the chain. Notice that Equation 5
is defined recursively, with the predictive distribution Gu in
context u being a function of the same in the parent s (u) and
so on up the tree.
6. DemonstRAtion
We now consider two target applications: language modeling and data compression. It is demonstrated that the
SM model is able to achieve better performance than
96 communicAtions of tHe Acm | FEBRUARY 2011 | voL. 54 | No. 2
most state-of-the-art techniques by capturing long-range
dependencies.
6. 1. Language modeling
Language modeling is the task of fitting probabilistic models
to sentences (sequences of words), which can then be used
to judge the plausibility of new sequences being sentences in
the language. For instance, “God save the Queen” should be
given a higher score than “Queen the God save” and certainly
more than “glad slave the spleen” under any model of English.
Language models are mainly used as building blocks in natural
language processing applications such as statistical machine
translation and automatic speech recognition. In the former,
for example, a translation algorithm might propose multiple
sequences of English words, at least one of which is hoped to
correspond to a good translation of a foreign language source.
Usually only one or a few of these suggested sequences are
plausible English language constructions. The role of the language model is to judge which English construction is best.
Better language models generally lead to better translators.
Language model performance is reported in terms of
a standard measure called perplexity. This is defined as
2 (x) where is the average log-loss
on a sequence x and the average number of bits per word
required to encode the sequence using an optimal code.
Another interpretation of perplexity is that it is the average
number of guesses the model would have to make before it
guessed each word correctly (if it makes these guesses by dra wing samples from its estimate of the conditional distribution).
For the SM model P(xi|x1:i− 1) is computed as in Equation 5. Both
lower log-loss and lower perplexity are better.
Figure 3 compares the SM model against nth order Markov
figure 3. in blue is the performance of the sm model (dashed line)
versus nth order markov models with hierarchical PyP priors (solid
line) as n varies (test data perplexity, lower is better). in red is the
computational complexity of the sm model (dashed line) versus the
markov models (solid line) in terms of the number of nodes in the
context tree/trie. for this four million word new york times corpus, as
n passes 4, the memory complexity of the markov models grows larger
than that of the sm, yet, the sm model yields modeling performance
that is better than all markov models regardless of their order. this
suggests that for n ≥ 4 the sm model is to be preferred: it requires less
space to store yet results in a comparable if not better model.
400
Perplexity
200
300
Markovperplexity
SM perplexity
Markov #nodes
SM #nodes
1. 5× 107
1. 2× 107
9× 106
Number of nodes
6× 106
3× 106
0
100
1
234
Context length(n)
5
6
0
b Note that an nth order Markov model is an m-gram model where m = n + 1.