of the types we deal with in code are
discrete and not “measurable” or real
number-like.
Because the values we care about are
usually not even comparable, we also
will avoid cumulative distributions. One
reason that mathematicians like standard continuous distributions—such
as Gaussian, beta, binomial, and uniform—is because of their nice algebraic
properties, called conjugate priors. 2 For
example, uniform prior combined with
a binomial likelihood results in a beta
posterior. This makes 18th- and 19th-
century probabilistic computations using pencil and paper feasible, but that is
not necessary now that there are powerful computers that can run millions of
simulations per second.
In programming examples, distributions typically come from outside
data as discrete frequentist collections
of data with an unknown distribution, or they are defined explicitly as a
Bayesian representation by enumerating a list of value/probability pairs. For
example, here is the distribution of
weight of adults in the United States,
according to the Centers for Disease
Control (CDC):
CDC ∈ ℙ (Weight)
CDC = [obese 0.4, skinny 0.6]
Efficiently sampling from com-
posed distributions is, indeed, rocket
science. Just like database query opti-
mizers, advanced sampling methods
leverage properties of the leaf distribu-
tions and the structure of the query9
or program3 that computes the distri-
bution. It leverages deep and complex
mathematical techniques such as im-
portance sampling, Metropolis-Hast-
ings, Markov Chain Monte Carlo, and
Gibbs sampling that are far outside the
scope of this article but are important
for making real-world computations
over probability distributions feasible.
As Bayesian analysis consultant John
D. Cook remarked “... Bayesian statis-
tics goes back over 200 years, but it did
not take off until the 1980s because
that’s when people discovered practi-
cal numerical methods for computing
with it …”
To illustrate the sophistication in-
volved in efficiently sampling known
discrete distributions, imagine con-
verting the example distribution CDC
pling from the collection and count-
ing the frequencies of each element
from a in dist.Take(n) group
by a into g select g.Key a
g.Sum()/n approximates the Bayesian
representation of the distribution.
When converting from the Bayesian
to the frequentist implementation,
the probabilities do not to have to
add up to 1, and the sampling pro-
cess will ensure the ratios are prop-
erly normalized.
Like true mathematicians, we will
silently switch between these two representations of distributions whenever convenient. Unlike mathematicians, however, to keep things simple
we will not consider continuous distributions. We want our distribution to
hold generically any type A, and most
Figure 1. Image recognition results.
person
95% 95% 93%
75%
56% 54%
hair face senior
citizen
hairstyle profession professional
83%
Figure 2. Frequentist representation.
inverse
transfer
1
1
1
1
1
1
0
0
0
0
rejection
sampling
x
x
x
x
x
x
0
0
0
0
x
x
x
x
1
1
1
1
1
1
rejection
sampling
x
x
0
0
0
0
1
1
1
1
1
1
alias
method
1
0
0
0
0
1
1
1
1
1
Figure 3. Joint probability distribution.
P(food & weight) burger celery P(weight)
obese 0.4*0.9 = 0.36 0.4*0.1 = 0.04 0.36+0.04 = 0.4
skinny 0.6*0.3 = 0.18 0.6*0.7 = 0.42 0.18+0.42 = 0.6
P(food) 0.36+0.18 = 0.54 0.04+0.42 = 0.46