ing semantics for privacy definitions
also provide two ways of interpreting
ε-differential privacy. The simulatability argument shows an algorithm satisfying ε-differential privacy provides the
following protection: an attacker cannot detect whether M was run on the
original data or on altered data from
which any given record was removed. 2, 14
This is true no matter how knowledgeable the attacker is, as long as the data
alteration is consistent with what is
known about the data; if not, additional leakage can occur, as explained in
the earlier discussion on composition
with other mechanisms. From a different perspective, the counterfactual
argument shows an algorithm M satisfying ε-differential privacy prevents
an attacker from learning how an individual differs from the data-generating
distribution precisely when all records
are independent. 25
Example: Blowfish
We illustrate this discussion with
Blowfish, 19 a new class of privacy definitions that follows the generic privacy
template in Definition 4. Like differential privacy, Blowfish definitions satisfy a number of desirable properties
we outlined earlier, including Kerckhoffs’s principle, self-composition,
convexity, and closure under post-processing. The privacy goals of Blowfish
definitions have both a counterfactual
and a simulatability interpretation. In
addition to satisfying these properties,
Blowfish definitions improve on differential privacy by including a generalized and formal specification of what
properties of an individual in the data
are kept private and by accounting for
external knowledge about constraints
in the data. Blowfish thus captures
part of the privacy design space. In the
rest of this section, we describe how
data owners can use Blowfish to customize privacy protections for their
applications.
Blowfish definitions take two parameters: privacy ε (similar to differential privacy) and policy P = (T, G, IQ)
allowing data curators to customize
privacy guarantees. Here, T is the set
of possible record values, Q is a set
of publicly known constraints on the
data, and IQ is the set of all possible
datasets consistent with Q. Specifying IQ allows a data curator to create
S can approximately simulate the behavior of M no matter what the true
data D is and no matter what alteration
was performed, then every individual
record is protected.
Privacy definitions based on simulatability are generally more complex
than those based on counterfactuals.
To check whether M satisfies the definitions, it is often necessary to find
the appropriate simulator S. However,
in some cases, the privacy definitions
can also be expressed using linear constraints, as in the generic privacy definition template.
Counterfactuals vs. simulatability.
The differences between counterfactual
and simulatability approaches depend
on the nature of the data that must be
protected. When the data records are
independent of each other, properties
of the data-generating distribution and
properties of a population are essentially the same (due to the law of large numbers), in which case both approaches
provide similar protection.
Data correlations. A difference arises
when there is correlation between individuals. First, we consider a scenario
when counterfactuals would be more
appropriate. Suppose a database contains records about Bob and his relatives. Even if Bob’s record is removed,
Bob’s susceptibility to various diseases
can be predicted from the rest of the
data because it contains his family’s
medical history. The general goal of
privacy definitions based on simulatability is not to hide this inference
but to hide how Bob’s actual record
differs from this prediction. On the
other hand, if we include probabilistic models of how diseases are passed
through genetics, then privacy definitions based on counterfactuals will try
to prevent predictions about Bob and
his family. Intuitively, this happens because the actual family medical history
is not a property of the data-generating
distribution but of a sample from that
distribution. Since the family medical
history is correlated with Bob’s record,
it would allow better predictions about
how Bob deviates from the data-generating distribution; hence, it must be
protected as well.
Next, we examine a situation where
simulatability-based privacy defini-
tions are more appropriate. Consider
a social network where many profiles
of individuals are public. Private in-
formation about individuals is often
predictable directly from the public
profiles of their friends and contacts. 37
Even if Bob’s profile is private, it is easy
to collect information that is correlat-
ed with Bob. Here, privacy definitions
based on simulatability are applicable,
allowing data curators to process the
social network data with algorithms M
that create outputs from which it is dif-
ficult to tell if Bob’s record was used in
the computation.
Data constraints. One difficulty in
designing privacy definitions is ac-
counting for public knowledge of
constraints the input database must
satisfy. Constraints may correlate the
values of different records, arising
due to, say, functional dependencies
across attributes or prior exact releas-
es of histograms. Correlations arising
from constraints provide inference
channels attackers could use to learn
sensitive information. A privacy defi-
nition must thus account for them; for
example, while Census data records
must be treated confidentially, certain
coarse population statistics must, by
law, be released exactly in order to de-
termine the number of Congressional
Representatives for each state. More
generally, if a histogram H of the data
has been released exactly, how can a
data curator choose a privacy defini-
tion, and hence constraints on M to
account for the information in the
histogram so any subsequent data re-
lease via M is able to ensure privacy?
Complete solutions to this problem
are open but appear to be easier for
approaches based on counterfactu-
als if we use data-generating distribu-
tions that are conditioned on the his-
togram, or P(D|H). 25 For approaches
based on simulatability, there is more
of a challenge since data-alteration
techniques consistent with previously
released information must be devel-
oped; recall, they provide the guar-
antee that an attacker would not be
able to reliably determine whether the
original dataset or altered dataset was
used in the computation. It is impor-
tant to note, too, that constraints on
the input data, and especially those
arising from prior releases, can be ex-
ploited for better utility.
Interpretations of differential privacy. These two approaches for defin-