the posterior distribution P(Q|x) = P(Q)P(x|Q)/P(x), which
specifies the belief about the parameter Q after combining both sources of information. Computations such as
prediction are then done taking into account the a posteriori
uncertainty about the underlying parameters.
What kinds of prior knowledge about natural sequence data
might we wish to employ? We make use of two: that natural
sequence data often exhibits power-law properties, and that
conditional distributions of similar contexts tend to be similar
themselves, particularly in the sense that recency matters. We
will consider each of these in turn in the rest of this section.
4. 1. Power-law scaling
As with many other natural phenomena like social networks
and earthquakes, occurrences of words in a language follow a power-law scaling. 23 This means that there are a small
number of words that occur disproportionately frequently
(e.g., the, to, of), and a very large number of rare words that,
although each occurs rarely, when taken together make up a
large proportion of the language. The power-law scaling in
written English is illustrated in Figure 1. In this subsection
we will describe how to incorporate prior knowledge about
power-law scaling in the true generative process into our
Bayesian approach. To keep the exposition simple, we will
start by ignoring contextual dependencies and instead focus
only on one way of estimating probability distributions that
exhibit power-law scaling.
To see why it is important to incorporate knowledge
about power-law scaling, consider again the ML estimate
given by the relative frequency of symbol occurrences
their corresponding probabilities will be well estimated since
they are based on many observations of the symbols. On the
other hand, our estimates of the rare symbol probabilities will
figure 1. illustration of the power-law scaling of word frequencies
in english text. the relative word frequency (estimated from a large
corpus of written text) is plotted against each word’s rank when
ordered according to frequency. one can see that there are a few
very common words and a large number of relatively rare words;
in fact, the 200 most common words account for over 50% of the
observed text. the rank/frequency relationship is very close to a
pure power law relationship which would be a perfectly straight
line on this log–log plot. Also plotted are samples drawn from a PyP
(in blue) and a Dirichlet distribution (in red) fitted to the data. the
Pitman–yor captures the power-law statistics of the english text
much better than the Dirichlet.
not be good at all. In particular, if a rare symbol did not occur
in our sequence (which is likely), our estimate of its probability will be zero, while the probability of a rare symbol that did
occur just by chance in our sequence will be overestimated.
Since most symbols in S will occur quite rarely under a power-law, our estimates of G(s) will often be inaccurate.
To encode our prior knowledge about power-la w scaling, we
use a prior distribution called the Pitman–Yor process (PYP), 15
which is a distribution over the discrete probability distribution . It has three parameters: a base distribution
, which is the mean of the PYP and reflects our
prior belief about the frequencies of each symbol, a discount
parameter a between 0 and 1 which governs the exponent of
the power-law, and a concentration parameter c which governs
the variability around the mean G0. When a = 0, the PYP loses
its power-law properties and reduces to the more well-known
Dirichlet process. In this paper, we assume c = 0 instead for
simplicity; see Gasthaus and Teh6 for the more general case
when c is allowed to be positive. When we write G PY (a, G0)
it means that G has a prior given by a PYP with the given parameters. Figure 1 illustrates the power-law scaling produced by
PY processes.
To convey more intuition about the PYP we can consider
how using it affects our estimate of symbol frequencies.
Note that in our Bayesian framework G is random, and one
of the standard steps in a procedure called inference is to
estimate a posterior distribution P(G|x) from data. The probability that symbol s Î S occurs next is then:
where E in this case stands for expectation with respect
to the posterior distribution P(G|x). This integral is a standard Bayesian computation that sometimes has an analytic
solution but often does not. When it does not, like in this
situation, it is often necessary to turn to numerical integration approaches, including sampling and Monte Carlo
integration. 16
Word frequency
100
106
105
104
103
102
101
100
Pitman–Yor
English text
Dirichlet
101 102 103 104 105
Rank(according to frequency)
Given this, it is natural to ask what purpose these M(s)’s
serve. By studying Equation 3, it can be seen that each M(s)
reduces the count N(s) by aM(s) and that the total amount
subtracted is then redistributed across all symbols in S
proportionally according to the symbols’ probability under
the base distribution G0. Thus non-zero counts are usually
reduced, with larger counts typically reduced by a larger
amount. Doing this mitigates the overestimation of probabilities of rare symbols that happen to appear by chance.
On the other hand, for symbols that did not appear at all,
the estimates of their probabilities are pulled upward
106