into a frequentist representation. 8
Perhaps the most obvious method
stacks the columns for skinny and
obese on top of each other and draws
one random number—say, p—between
0 and 1 and then checks if p ≤ 0.4 yields
obese, and otherwise yields skinny.
In general, this search is linear in the
number values in the distribution, but
using tricks like binary search tree can
speed things up. Mathematicians call
this the inverse transfer method.
Another way is first to select a random integer—say, weight—to select
between obese and skinny, and then
choose a random double between 0
and 1—say, p—and if CDC[weight] ≤ p,
then yield weight, as shown in Figure
2. Mathematicians call this algorithm
rejection sampling, and as the histogram shows, half of the attempts to
sample a value from the distribution
will fail (the pink part). This can be improved by picking a tighter envelope
distribution, like that in the second
histogram, but that still rejects two out
of 12 samples.
The last method pads the lower
probabilities by borrowing from the
higher probabilities. Amazingly, it is
always possible to do this in a way such
that every column represents the probabilities for, at most, two values, so we
need only one comparison to pick the
right value. This comparison can be implemented using a second index table,
and hence mathematicians call this
sampling algorithm the alias method.
Conditional Probability Distributions
Now that we have explained probability distributions ℙ(A), let’s examine
conditional probability distributions
ℙ(B|A), which, according to Wikipedia, are “a measure of the probability
of an event given that (by assumption,
presumption, assertion, or evidence)
another event has occurred.” To developer ears that sounds exactly like
a function A→ℙ(B) that returns a distribution, just like a learned model.
The remainder of this article uses the
notations ℙ(B|A) and A→ℙ(B) interchangeably.
Going back to the example, we
have the following model Doctor ∈
ℙ(Food|Weight) of food preference,
given weight, that we could have obtained by asking patients what kind of
food they consume:
Doctor ∈ ℙ(Food|Weight)
= Weight → ℙ(Food)
Doctor(obese)
= [burger0.9, celery0.1]
Doctor(skinny)
= [burger0.3 celery0.7]
As argued in the introduction,
these probabilistic functions, such as
ℙ(Object|Image), ℙ(Text|Audio),
ℙ(Spam|Text), and so on, increasingly are the result of training some
ML algorithm or neural net, instead of
being coded by expensive and flaky human developers.
Now that you know that conditional
probabilities are probabilistic functions, things are starting to get interesting, since this means that multiplication (*) used in Bayes’ rule is an
operator that applies a probabilistic
function to a probability distribution
as a parameter—that is, it has the following type signature:
ℙ(B|A)*ℙ(A) ∈ ℙ(A&B)
Using the Bayesian representation of distributions, you can implement a probabilistic function application likelihood*prior
where likelihood∈ℙ(B|A) and
prior∈ℙ(A), using the following correlated query:
likelihood*prior =
from ap in prior
from bq in likelihood(a)
select (a,b)p*q
Applying this definition to compute the result of Doctor*CDC, we
obtain the table shown in Figure 3
for the joint probability distribution
ℙ(Food&Weight).
Because the distributions for
ℙ(Weight) and ℙ(Food) appear in
the margins of this table, mathematicians call them marginal probabilities, and similarly the process of
summing up the columns/rows is
called marginalization. When computing a joint distribution using (*),
mathematicians often use the name
likelihood for the function and prior
for the argument.
The beauty of the frequentist representation is that there is no need for
multiplying probabilities. Sampling
ensures the underlying ratio of occur-
Efficiently sampling
from composed
distributions is,
indeed, rocket
science.