models with hierarchical PYP priors, for various values of
n, on a four million word New York Times corpusb. Table 1
compares the hierarchical PYP Markov model and the SM
model against other state-of-the-art models, on a 14 million
word Associated Press news article corpus. The AP corpus is a
benchmark corpus for which the performance of many models is available. It is important to note that it was processed by
Bengio et al. 2 to remove low frequency words. Note that this
is to the detriment of the SM, which is explicitly designed to
improve modeling of low word frequencies due to power-law
scaling. It is also a relatively small corpus, limiting the benefits of the SM model at capturing longer range dependencies.
Our results show that the SM model is a competitive language
model. However, perplexity results alone do not tell the whole
story. As more data is used to estimate a language model, typically its performance improves. This means that computational
considerations such as memory and runtime must enter into
the discussion about what constitutes a good language model.
In many applications, fast prediction is imperative, in others, particularly in online settings, incorporation of new data
into the model must be fast. In comparison to more complex
language models, prediction in the SM has real-world time
complexity that is essentially the same as that of a smoothed
finite-order Markov model, while its memory complexity is
linear in the amount of data. The computational complexity of Markov models theoretically does not depend on the
amount of data but is exponential in the Markov order, rendering straightforward extensions to higher orders impractical. The SM model directly fixes this problem while remaining
computationally tractable. Constant space, constant time
extensions to the SM model1 have been developed, which show
great promise for language modeling and other applications.
6. 2. compression
Shannon’s celebrated results in information theory17 have
led to lossless compression technology that, given a coding
distribution, nearly optimally achieves the theoretical lower
limit (given by the log-loss) on the number of bits needed to
encode a sequence. Lossless compression is closely related
to sequence modeling: an incrementally constructed probabilistic sequence model such as the SM can be used to
adaptively construct coding distributions which can then be
table 1. Language modeling performance for a number of models on an
Associated Press news corpus (lower perplexity is better). interpolated
and modified Kneser–ney are state-of-the-art language models. Along
with hierarchical PyP and the sequence memoizer, these models do not
model relationships among words in the vocabulary. Provided for
comparison are the results for the models of Bengio et al. and mnih
et al. which belong to a different class of models that learn word
representations from data
source
Bengio et al. 2
Mnih et al. 13
4-gram interpolated Kneser–Ney3, 18
4-gram Modified Kneser–Ney3, 18
4-gram hierarchical PyP18
Sequence Memoizer21
Perplexity
109.0
83. 9
106. 1
102. 4
101. 9
96. 9
table 2. compression performance in terms of weighted average
log-loss (average bits per byte under optimal entropy encoding, lower
is better) for the calgary corpus, a standard benchmark collection of
diverse filetypes. the results for unboundedlength context PPm is from
cleary and teahan. 4 the results for ct W is from Willems. 20 the bzip2
and gzip results come from running the corresponding standard unix
command line tools with no extra arguments
model
sm
1. 89
PPm
1. 93
ctW
1. 99
bzip2
2. 11
gzip
2. 61
directly used for compression based on entropy coding.
We demonstrate the theoretical performance of a lossless compressor based on the SM model on a number of
standard compression corpora. Table 2 summarizes a com-parsion of our lossless compressor against other state-of-the-art compressors on the Calgary corpus, a well-known
compression benchmark consisting of 14 files of different
types and varying lengths.
In addition to the experiments on the Calgary corpus, SM
compression performance was also evaluated on a 100MB
excerpt of the English version of Wikipedia (XML text
dump). 12 On this excerpt, the SM model achieved a log-loss
of 1. 66 bits/symbol amounting to a compressed file size of
20.80MB. While this is worse than 16.23MB achieved by the
best demonstrated Wikipedia compressor, it demonstrates
that the SM model can scale to sequences of this length. We
have also explored the performance of the SM model when
using a larger symbol set (S). In particular, we used the SM
model to compress UTF- 16 encoded Chinese text using a
16-bit alphabet. On a representative text file, the Chinese
Union version of the bible, we achieved a log-loss of 4. 91 bits
per Chinese character, which is significantly better than the
best results in the literature ( 5. 44 bits). 22
7. concLusion
The SM achieves improved compression and language modeling performance. These application-specific performance
improvements are arguably worthwhile scientific achievements by themselves. Both have the potential to be tremendously useful, and may yield practical consequences of
societal and commercial value.
We encourage the reader, however, not to mentally categorize the SM as a compressor or language model. Nature is
replete with discrete sequence data that exhibits long-range
dependencies and power-law characteristics. The need to
model the processes that generate such data is likely to grow
in prevalence. The SM is a general purpose model for discrete sequence data that remains computationally tractable
despite its power and despite the fact that it makes only very
general assumptions about the data generating process.
Our aim in communicating the SM is also to encourage
readers to explore the fields of probabilistic and Bayesian
modeling in greater detail. Expanding computational capacity along with significant increases in the amount and
variety of data to be analyzed across many scientific and
engineering disciplines is rapidly enlarging the class of
probabilistic models that one can imagine and employ.
Over the coming decades hierarchical Bayesian models are