Implementation concerns. As with
many aspects of security, moving from
theory to practice requires great care.
In particular, a naive implementation
of a privacy-preserving algorithm may
not ensure privacy even though the algorithm is theoretically proven to satisfy the requirements of a chosen privacy
definition. One problem arises from
side-channels. Consider a program
that processes a sensitive database and
behaves as follows: If Bob’s record is in
the database, it produces the output 1
after one day; if Bob’s record is not in
the database, it outputs 1 right away.
The output is the same, no matter what
the database is. But by observing the
time taken for a result to be output, we
learn something about the database. 18
Another concern arises when the
theoretical algorithms base their security properties on exact computation
that may be beyond the limits of digital computers. 5, 31 The most common
example is the addition of noise from
continuous distributions (such as
Gaussian and Laplace). For most floating-point implementations, an analysis of the bit patterns yields additional
information about the input data. 31
Finally, many privacy mechanisms
rely on a random number generator. A
provably secure implementation of a
privacy-preserving algorithm must be
tailored to the quality of the randomness of the bits. 8
A Generic Recipe
Privacy definitions that are convex,
closed under post-processing, and require protections for all outputs of a
mechanism M all have a similar format and can be written in terms of linear constraints, 28 as in the following
generic template:
Definition 4 (a generic privacy defini-
tion). Let D1, D2,… be the collection of pos-
sible input datasets. For some fixed con-
stants
an algorithm M must satisfy the follow-
ing conditions for every possible output
the algorithm ω can produce:
To evaluate a proposed privacy defi-
nition, a good sanity check for current
best practices is thus to verify whether
or not the algorithm M can be ex-
pressed as linear constraints on the
probabilistic behavior of algorithms, as
in Definition 4; for example, k-anonym-
ity does not fit this template, 28 but with
ε-differential privacy, there is a linear
constraint P(M(Dj1) = ω) – eεP(M(Dj2)
= ω) ≤ 0 for every pair of datasets
Dj1, Dj2 that differ on the presence of
one rec-ord. We next discuss some
semantic guarantees achievable
through this template.
Good and bad disclosures. Even
when published data allows an analyst
to make better inferences about Bob,
Bob’s privacy has not necessarily been
violated by this data. Consider Bob’s
nosy but uninformed neighbor Charley, who knows Bob is a life-long chain-smoker and thinks cancer is unrelated
to smoking. After seeing data from a
smoking study, Charley learns smoking causes cancer and now believes
that Bob is very likely to suffer from
it. This inference may be considered
benign (or unavoidable) because it is
based on a fact of nature.
Now consider a more nuanced situation where Bob participates in the
aforementioned smoking study, the
data is processed by M, and the result
ω (which shows that smoking causes
cancer) is published. Charley’s beliefs
about Bob can change as a result of
the combination of two factors: by him
learning that smoking causes cancer,
and since Bob’s record may have affected the output of the algorithm.
This latter factor poses the privacy
risks. There are two approaches to isolate and measure whether Charley’s
change in belief is due to Bob’s record
and not due to his knowledge of some
law of nature—“counterfactuals” 12, 25, 33
and “simulatability.” 2, 15, 29
Privacy via counterfactuals. The first
approach12, 25, 33 based on counterfactual reasoning is rooted in the idea that
learning the true distribution underlying the private database is acceptable,
but learning how a specific individual’s
data deviates from this distribution is a
privacy breach.
Consider pairs of alternatives
(such as “Bob has cancer” and “Bob is
healthy”). If the true data-generating
distribution θ is known, we could use
it to understand how each alternative
affects the output of M (taking into ac-
count uncertainty about the data) by
considering the probabilities Pθ(M out-
puts ω | Bob has cancer) and Pθ(M out-
puts ω | Bob is healthy). Their ratio is
known as the “odds ratio.” It is the mul-
tiplicative factor that converts the ini-
tial odds of Bob having cancer (before
seeing ω) into the updated odds (after
seeing ω). When the odds ratio is close
to 1, there is little change in the odds,
and Bob’s privacy is protected.
Why does this work? If the rea-
soning is done using the true distri-
bution, then we have bypassed the
change in beliefs due to learning
about laws of nature. After seeing ω,
the change in Charley’s beliefs de-
pends only on the extent to which ω
is influenced by Bob (such as it was
computed using Bob’s record).
What if the true distribution is unknown? To handle this scenario, the
data curator can specify a set Ξ of plausible distributions and ensure reasoning with any of them is harmless; the
corresponding odds ratios are all close
to 1. A counterfactual-based privacy definition would thus enforce constraints
like Pθ(M outputs ω | Bob has cancer) ≤
eεPθ(M outputs ω | Bob is healthy) for all
possible ω, for various pairs of alternatives and distributions θ. When written
mathematically, these conditions turn
into linear constraints, as in the generic
template (Definition 4).
Privacy via simulatability. The second approach, 2, 15, 29 based on simulatability, is motivated by the idea that
learning statistics about a large population of individuals is acceptable, but
learning how an individual differs from
the population is a privacy breach.
The main idea is to compare the behavior of an algorithm M with input
D to another algorithm, often called
a “simulator,” S with a safer input D′;
for example, D′ could be a dataset that
is obtained by removing Bob’s record
from D. If the distribution of outputs of
M and S are similar, then an attacker
is essentially clueless about whether ω
was produced by running M on D or by
running S on D ′. Now S does not know
anything about Bob’s record except
what it can predict from the rest of the
records in D′ (such as a link between
smoking and cancer). Bob’s record is
thus protected. Similarly, Alice’s privacy can be tested by considering different alterations where Alice’s record
is removed instead of Bob’s record. If