nth order Markov model for x: It says that the distribution
over each symbol xi in x, given that its context consisting of
the previous n symbols xi-n:i− 1 is u, is simply Gu.
The hierarchical Bayesian model in Equation 4 is called
the hierarchical Pitman-Yor process. 18 It formally encodes
our context tree similarity assumption about the conditional distributions using dependence among them
induced by the hierarchy, with more similar distributions being more dependent. It is this dependence which
allows the model to share information across the different
contexts, and subsequently improve the estimation of all
conditional distributions. It is worth noting that there is a
well-known connection between the hierarchical PYP and
a type of smoothing for m-gram language models called
interpolated Kneser–Ney. 10, 18
5. seQuence memoizeR
The name sequence memoizer (SM) refers to both an extension
of the hierarchical PYP model presented in the previous section, as well as to the set of techniques required to make practical use of the extended model. We first describe how the SM
model extends the hierarchical PYP model and then discuss
how to reduce the complexity of the model to make it computationally tractable. Finally, we sketch how inference is done
in the SM.
5. 1. the model
The SM model is a notationally subtle but important extension to the hierarchical PYP model ( 4) is described in the
previous section. Instead of limiting the context lengths
to n, the model is extended to include the set of distributions in all contexts of any (finite) length. This means that
the distribution over each symbol is now conditioned on all
previous symbols, not just the previous n.
Formally, the SM model is defined exactly as the hierarchical PYP model in Equation 4, but with two differences.
First, the contexts range over all finite nonempty strings,
u Î S*\e. Second, in the third line of Equation 4, instead of
conditioning only on the previous n symbols, we condition
on all previous symbols, so that xi|x1:i− 1 = u, Gu Gu. The
assumptions embodied in the resulting model remain the
same as that for the hierarchical PYP model: power-law
scaling and similarity between related contexts.
The SM model can be interpreted as the limit of a
hierarchical PYP model as the Markov order n tends to
infinity. One’s first impression of such a model might be
that it would be impossible to handle, both statistically
because of overfitting and other problems, and computationally because the model as described so far cannot
even be represented in a computer with finite memory!
Fortunately the Bayesian approach, where we compute
the posterior distribution and marginalize over the
parameters as in Equation 2 to obtain estimators of interest, prevents overfitting. Additionally, the techniques we
develop in the next subsection make computation in this
model practical.
5. 2. compacting the context tree
While lifting the finite restriction on context lengths
seems very desirable from a modeling perspective, the
resulting SM model is a prior over an infinite number of
parameters (conditional distributions) . In order
to compute in this model, the number of conditional
distributions that is accessed must be reduced to a finite
number. The key to realizing that this is possible is that
given a finite length sequence of symbols x, we only need
access to a finite number of conditional distributions. In
particular, we only need Gx1:i where i = 0, . . . , T and all the
ancestors of each Gx1:i in the context tree. The ancestors
are needed because each Gu has a prior that depends on
its parent Gs(u). The resulting set of conditional distribu-
tions that the sequence x actually depends on consists of
Gu where u ranges over all contiguous substrings of x, a
finite set of O( T 2) contexts. All other contexts in the tree
can effectively be ignored. We denote this subtree of the
context tree that x actually depends on by T (x); Figure 2b
shows an example with x = 0110.