from zero, mitigating underestimation of their probability.
We describe this effect as “stealing from the rich and giving to the poor.” This is precisely how the PYP manifests a
power-law characteristic. If one thinks of the M(s)’s and a as
parameters then one could imagine ways to set them to best
describe the data. Intuitively this is not at all far from what is
done, except that the M(s)’s and a are themselves treated in a
Bayesian way, i.e., we average over them under the posterior
distribution in Equation 3.
4. 2. context trees
We now return to making use of the contextual dependencies in x and to estimating all of the conditional distributions Gu relevant to predicting symbols following a general
context u. The assumption we make is that if two contexts
are similar, then the corresponding conditional distributions over the symbols that follow those contexts will tend
to be similar as well. A simple and natural way of defining
similarity between contexts is that of overlapping contextual suffixes. This is easy to see in a concrete example from
language modeling. Consider the distribution over words
that would follow u = in the United States of. The assumption we make is that this distribution will be similar to
the distribution following the shorter context, the United
States of, which we in turn expect to be similar to the distribution following United States of. These contexts all share
the same length three suffix.
In this section and the following one, we will discuss
how this assumption can be codified using a hierarchical
Bayesian model. 8, 11 To start, we will only consider fixed,
finite length contexts. When we do this we say that we are
making an nth order Markov assumption. This means that
each symbol only depends on the last n observed symbols.
Note that this assumption dictates that distributions are
not only similar but equal among contexts whose suffixes
overlap in their last n symbols. This equality constraint is
a strong assumption that we will relax in Section 5.
We can visualize the similarity assumption we make by con-
structing a context tree: Arrange the contexts u (and the associ-
ated distributions Gu) in a tree where the parent of a node u,
denoted s(u), is given by its longest proper suffix (i.e., u with its
first symbol from the left removed). Figure 2 gives an example
of a context tree with n = 3 and S = {0, 1}. Since for now we
are making an nth order Markov assumption, it is sufficient
to consider only the contexts u Î Su = {u′ Î S*: |u′| ≤ n} of
length at most n. The resulting context tree has height n and
the total number of nodes in the tree grows exponentially in n.
The memory complexity of models built on such context trees
usually grows too large and too quickly for reasonable values
of n and |S|. This makes it nearly impossible to estimate all of
the distributions Gu in the naïve way described in Section 2.
This estimation problem led us to hierarchical Bayesian mod-
eling using Pitman–Yor processes.
4. 3. Hierarchical Pitman–yor processes
Having defined a context tree and shown that the Pitman–
Yor prior over distributions exhibits power-law characteristics, it remains to integrate the two.
Recall that G PY(a, G0) means that G is a random distribution with a PYP prior parameterized by a discount parameter a and a base distribution G0. The expected value of G
under repeated draws from the PYP is the base distribution G0.
Because of this fact we can use this process to encode any
assumption that states that on average G should be similar
to G0. To be clear, this is just a prior assumption. As always,
observing data may lead to a change in our belief. We can
use this mechanism to formalize the context tree notion of
similarity. In particular, to encode the belief that Gu should
be similar to Gs(u), we can use a PYP prior for Gu with base
distribution Gs(u). We can apply the same mechanism at each
node of the context tree, leading to the following model
specification:
Ge PY(a0, G0)
Gu|Gs(u) PY(a|u|, Gs(u))
xi |xi – n:i – 1 = u, Gu Gu
( 4)
for all u Î Sn *\e
for i = 1, . . . , T
The second line says that a priori the conditional distribution Gu should be similar to Gs(u), its parent in the context
tree. The variation of Gu around its mean Gs(u) is described
by a PYP with a context length-dependent discount parameter a|u|. At the top of the tree the distribution Ge for the
empty context e is similar to an overall base distribution
G0, which specifies our prior belief that each symbol s will
appear with probability G0(s). The third line describes the
figure 2. (a) full context tree containing all contexts up to length 3 over symbol set S = {0, 1}. (b) context tree actually needed for the string
0110. observations in the context in which they were observed are denoted in gray below the corresponding context. (c) compact context tree
for the same string, with non-branching chains marginalized out.
(a)
(b)
(c)
Gε
Gε
Gε
0
1
0
1
0
1
G0
0
G00
0
1
0
1
G10
1
0
G01
0
1
G1
1
G11
0
1
G0
0
1
0
G01
G1
G0
1
G11
1
0
0
G01
G1
1
0
0
1
1
G011 G111 G001 G101 G010 G110 G000 G100
G011
0
G011
0