statistical databases are designed to
teach can, sometimes indirectly, cause
damage to an individual, even if this
individual is not in the database.
In practice, statistical databases
are (typically) created to provide some
anticipated social gain; they teach us
something we could not (easily) learn
without the database. Together with
the attack against Turing, and the
fact that he did not have to be a member of the database for the attack to
work, this suggests a new privacy goal:
Minimize the increased risk to an individual incurred by joining (or leaving)
the database. That is, we move from
comparing an adversary’s prior and
posterior views of an individual to comparing the risk to an individual when
included in, versus when not included
in, the database. This makes sense.
A privacy guarantee that limits risk
incurred by joining encourages participation in the dataset, increasing social
utility. This is the starting point on our
path to differential privacy.
Differential Privacy
Differential privacy will ensure that
the ability of an adversary to inflict
harm (or good, for that matter)—of
any sort, to any set of people—should
be essentially the same, independent
of whether any individual opts in to, or
opts out of, the dataset. We will do this
indirectly, simultaneously addressing
all possible forms of harm and good,
by focusing on the probability of any
given output of a privacy mechanism
and how this probability can change
with the addition or deletion of any
row. Thus, we will concentrate on pairs
of databases (D, D¢) differing only in
one row, meaning one is a subset of
the other and the larger database contains just one additional row. Finally,
to handle worst-case pairs of databases, our probabilities will be over the
random choices made by the privacy
mechanism.
Definition 1. A randomized function K
gives e-differential privacy if for all datasets D and D¢ differing on at most one row,
and all S ⊆ Range(K),
( 1)
where the probability space in each case
is over the coin flips of K.
The multiplicative nature of the guarantee implies that an output whose
probability is zero on a given database
must also have probability zero on any
neighboring database, and hence, by
repeated application of the definition, on any other database. Thus,
Definition 1 trivially rules out the
subsample-and-release paradigm discussed: For an individual x not in the
dataset, the probability that x’s data
is sampled and released is obviously
zero; the multiplicative nature of the
guarantee ensures that the same is
true for an individual whose data is in
the dataset.
Any mechanism satisfying this definition addresses all concerns that any
participant might have about the leakage of his or her personal information,
regardless of any auxiliary information
known to an adversary: Even if the participant removed his or her data from
the dataset, no outputs (and thus consequences of outputs) would become
significantly more or less likely. For
example, if the database were to be consulted by an insurance provider before
deciding whether or not to insure a
given individual, then the presence or
absence of any individual’s data in the
database will not significantly affect
his or her chance of receiving coverage.
Definition 1 extends naturally to
group privacy. Repeated application
of the definition bounds the ratios of
probabilities of outputs when a collection C of participants opts in or opts
out, by a factor of e|C|e. Of course, the
point of the statistical database is to
disclose aggregate information about
large groups (while simultaneously
protecting individuals), so we should
expect privacy bounds to disintegrate
with increasing group size.
The parameter e is public, and its
selection is a social question. We tend
to think of e as, say, 0.01, 0.1, or in
some cases, ln 2 or ln 3.
Sometimes, for example, in the census, an individual’s participation is
known, so hiding presence or absence
makes no sense; instead we wish to
hide the values in an individual’s row.
Thus, we can (and sometimes do)
extend “differing in at most one row”
to mean having symmetric difference
at most 1 to capture both possibilities.
However, we will continue to use the
original definition.
Returning to randomized response,
we see that it yields e-differential privacy for a value of e that depends on
the universe from which the rows are
chosen and the probability with which
a random, rather than non-random,
value is contributed by the respondent. As an example, suppose each
row consists of a single bit, and that
the respondent’s instructions are to
first flip an unbiased coin to determine
whether he or she will answer randomly or truthfully. If heads (respond
randomly), then the respondent is to
flip a second unbiased coin and report
the outcome; if tails, the respondent
answers truthfully. Fix b Î {0, 1}. If the
true value of the input is b, then b is output with probability 3/4. On the other
hand, if the true value of the input is
1 − b, then b is output with probability
1/4. The ratio is 3, yielding (ln 3)-differ-
ential privacy.
Suppose n respondents each employ
randomized response independently,
but using coins of known, fixed, bias.
Then, given the randomized data, by
the properties of the binomial distribution the analyst can approximate
the true answer to the question “How
many respondents have value b?” to
within an expected error on the order
of . As we will see, it is possible
to do much better—obtaining constant
expected error, independent of n.
Generalizing in a different direction, suppose each row now has two
bits, each one randomized independently, as described earlier. While each
bit remains (ln 3)-differentially private,
their logical-AND enjoys less privacy.
That is, consider a privacy mechanism
in which each bit is protected by this
exact method of randomized response,
and consider the query: “What is the
logical-AND of the bits in the row of
respondent i (after randomization)?”
If we consider the two extremes, one
in which respondent i has data 11
and the other in which respondent
i has data 00, we see that in the first
case the probability of output 1 is 9/16,
while in the second case the probability is 1/16. Thus, this mechanism is at
best (ln 9)-differentially private, not
ln 3. Again, it is possible to do much
better, even while releasing the entire
4-element histogram, also known as a
contingency table, with only constant
expected error in each cell.