from S and would like to estimate the probability that the
next symbol takes on a particular value.
One way to estimate the probability that the next symbol takes some value s Î S is to use the relative frequency
of its occurrence in x, i.e., if s occurs frequently in x we
expect its probability of appearing next to be high as well.
Assuming that x is long enough, doing this will be better than giving equal probability to all symbols in S. Let us
denote by N(s) the number of occurrences of s in x. Our estimate of the probability of s being the next symbol is then
. The function G is a discrete
distribution over the elements of S: it assigns a non-negative
number G(s) to each symbol s signifying the probability of
observing s, with the numbers summing to one over S.
Of course, this approach is only reasonable if the process
generating x has no history dependence (e.g., if x is produced by a sequence of tosses of a biased coin). It is highly
unsatisfying if there are contextual dependencies which
we can exploit. If we start accounting for context, we can
quickly improve the quality of the predictions we make. For
instance, why not take into account the preceding symbol?
Let u be another symbol. If the last symbol in x is u, then we
can estimate the probability of the next symbol being s by
counting the number of times s occurs after u in x. As before,
we can be more precise and define
to be the estimated probability of s occurring after u, where
N(us) is the number of occurrences of the subsequence us in x.
The function Gu is again a discrete distribution over the symbols in S, but it is now a conditional distribution as the probability assigned to each symbol s depends on the context u.
3. mAXimum LiKeLiHooD
Some readers may realize that the counting procedure
described above corresponds to a ubiquitous statistical estimation technique called maximum likelihood
(ML) estimation. The general ML estimation setup is
as follows: we observe some data x which is assumed to
have been generated by some underlying stochastic process and wish to estimate parameters Q for a probabilistic model of this process. A probabilistic model defines
a distribution P(x|Q) over x parameterized by Q, and the
ML estimator is the value of Q maximizing P(x|Q). In our
a It is straightforward to consider multiple sequences in our setting, we
consider being given only one sequence in this paper for simplicity.
92 communicAtions of tHe Acm | FEBRUARY 2011 | voL. 54 | No. 2
case, the data consists of the observed sequence, and the
parameters are the conditional distributions Gu for some
set of u’s.
In a sense, ML is an optimistic procedure, in that it
assumes that x is an accurate reflection of the true underlying process that generated it, so that the ML parameters will be an accurate estimate of the true parameters.
It is this very optimism that is its Achilles heel, since
it becomes overly confident about its estimates. This
situation is often referred to as overfitting. To elaborate
on this point, consider the situation in which we have
long contexts. The denominator of ( 1) counts the number
of times that the context u occurs in x. Since x is of finite
length, when u is reasonably long, the chance that u never
occurs at all in x can be quite high, so ( 1) becomes undefined with a zero denominator. More pernicious still is if
we are “lucky” and u did occur once or a few times in x. In
this case ( 1) will assign high probability to the few symbols
that just by chance did follow u in x, and zero probability
to other symbols. Does it mean that these are the only symbols we expect to see in the future following u, or does it
mean that the amount of data we have in x is insufficient to
characterize the conditional distribution Gu Given a complex process with many parameters the latter is often the
case, leading to ML estimates that sharpen far too much
around the exact observations and don’t reflect our true
uncertainty.
Obviously, if one uses models that consider only short
context lengths, this problem can largely be avoided if one
has enough data to estimate some (relatively) smaller number of conditional distributions. This is precisely what is typically done: one makes a fixed-order Markov assumption and
restricts oneself to estimating collections of distributions
conditioned on short contexts (for instance, an nth-order
Markov model, or an m-gram language model). The consequence of doing this is that ML estimation becomes
feasible, but longer-range dependencies are discarded. By
assumption and design, they cannot be accounted for by
such restrictive models.
Even having imposed such a restriction, overfitting often
remains an issue. This has led to the development of creative
approaches to its avoidance. The language modeling and
text compression communities have generally called these
smoothing or back-off methodologies (see Chen and Goodman3
and references therein). In the following, we will propose a
Bayesian approach that retains uncertainty in parameter estimation and thus avoids overconfident estimates.
4. BAyesiAn moDeLinG
As opposed to ML, the Bayesian approach is inherently
conservative. Rather than trusting the data fully, Bayesian
parameter estimation incorporates both evidence from the
data as well as from prior knowledge of the underlying process. Furthermore, uncertainty in estimation is taken into
account by treating the parameters Q as random, endowed
with a prior distribution P(Q) reflecting the prior knowledge
we have about the true data generating process. The prior
distribution is then combined with the likelihood P(x|Q) to
yield, via Bayes’ Theorem (the namesake of the approach),