Likewise, privacy-mechanism designers should always assume attackers are smarter than they are.
Just because the designer of a privacy
mechanism cannot deduce sensitive
information from the output of a piece
of software, an adversary will also fail.
A well-engineered privacy definition
will overcome these disadvantages,
protecting sensitive information from
clever attackers who know how M operates. We explain how in subsequent
sections.
Post-processing. A privacy definition determines the mechanisms that
data curators can trust to not leak sensitive information. Let M be one such
mechanism, and suppose A is some
algorithm that can be applied to the
output of M; for example, suppose M
creates synthetic data from its inputs,
and A builds a decision tree. Let the
notation A ° M denote the composite
algorithm that first applies M to the
sensitive data and then runs A on the
sanitized output of M.
If M is trusted, should this composite algorithm A ° M also be trusted? Intuitively, the answer is yes. It would be
very strange if a data curator released
privacy-preserving synthetic data but
then claimed building statistical models
from this data is a violation of privacy.
A privacy definition is closed under post-processing if A ° M satisfies
the constraints defining the privacy
definition whenever M does. Differential privacy11 satisfies this property,
but k-anonymity does not. 23 Closure
under post-processing has two important consequences: First, it ensures
compatibility between a privacy definition and Kerckhoffs’s principle; for
example, some algorithms that satisfy
k-anonymity are susceptible to a so-called minimality attack. 13, 36 For each
such k-anonymity mechanism M, it
is possible to craft a post-processing
algorithm A that takes the output of
M, undoes some of its data transformations, and outputs a new dataset
having records with possibly unique
quasi-identifier values that are vulnerable to linkage attacks with external
data. That is, the composite algorithm
A ° M does not satisfy the same conditions as M, or k-anonymity, and often
reveals sensitive records.
By contrast, suppose an ε-differentially
private algorithm M is applied to the
of databases D1, D2 that differ on the
addition or removal of a single record,
.
Intuitively, ε-differential privacy
guarantees that adding or removing a
record from the data will have little effect on the output of a privacy mechanism M. For small ε, it means M will
probably produce the same sanitized
output regardless of whether or not
Bob’s record is in the data.
How should a data curator choose ε?
Consider a highly targeted query about
an individual (such as asking if Bob’s
record is in the data). For ∈
-differential privacy, the most revealing privacy
mechanism answers truthfully with
probability eε/( 1 + eε) and falsely with
probability 1/( 1 + eε). 24 When ε is close
to 0, both these probabilities are close
to ½, and little information is provided;
the mechanism is almost as likely to
lie as respond truthfully; for example,
when ε = 0.1, the true answer probability is ≈ 0.525, and when ε = 0.01, the
probability is ≈ 0.502. We recommend
choosing ε based on how close the curator wants this value to be to ½.
The Laplace mechanism is a popular
mechanism for ε-differential privacy.
Let f be a function that computes a vec-
tor of query answers on the data. To
each query answer, the Laplace mech-
anism adds an independent Laplace
random variable with mean 0 and
standard deviation , where
S(f) is the global sensitivity of f − the
largest possible change in f due to the
addition of one record, or the maxi-
mum of over pairs of
databases D1, D2 that differ in one rec-
ord. Intuitively, the noise masks the
influence of any single record on the
result of f. Now consider:
Definition 2 (k-anonymity. 34, 35) Given
a set Q of attributes, known as the
quasi-identifier, a table is k-anonymous
if every record in it has the same quasi-
identifier values as k− 1 other records.
An algorithm satisfies k-anonymity if it
outputs only k-anonymous tables.
K-anonymity defends against one
type of attack called a “linkage attack”—
joining an external dataset that associates an identity (such as name) with the
quasi-identifier (such as ZIP code and
age) to a k-anonymous table containing
this publicly available quasi-identifier.
Its goal is to prevent the matching of an
identity to a single tuple in the k
-anonymous table; clearly, there will always be
at least k candidates in the join result. K-
anonymity mechanisms usually operate
by coarsening attributes (such as dropping digits from ZIP codes and changing
ages to age ranges); see Figure 1 for two
examples of k-anonymous tables.
Security Without Obscurity
The process of sanitizing sensitive data
through a privacy mechanism M must
follow Kerckhoffs’s principle21 and ensure privacy even against adversaries
who might know the details of M, except for the specific random bits it may
have used. Better yet, the mechanism
M must be revealed along with the sanitized output.
The reasons for making the mechanism M publicly known are twofold:
First, history shows “security through
obscurity” is unreliable in many applications; and, second, the output of M
must be useful. This sanitized output
often takes the form of a dataset or statistical model and could be intended to
support scientific research. In such a
case, statistical validity is an important
concern, and statistically valid conclusions can be drawn only when scientists
know precisely what transformations
were applied to the base sensitive data.
Figure 1. Examples of k-anonymity:
(a) 4-anonymous table; (b) 3-anonymous
table.
Zip Code Age Disease
130** 25–30 None
130** 25–30 Stroke
130** 25–30 Flu
130** 25–30 Cancer
902** 60–70 Flu
902** 60–70 Stroke
902** 60–70 Flu
902** 60–70 Cancer
Zip Code Age Disease
130** < 40 Cold
130** < 40 Stroke
130** < 40 Rash
1485* ≥ 40 Cancer
1485* ≥ 40 Flu
1485* ≥ 40 Cancer
(a)
(b)