wild noise magnitudes, provided the
other responses have noise magnitude
These are not just interesting mathematical exercises. We have been
focusing on interactive privacy mechanisms, distinguished by the involvement of the curator in answering each
query. In the noninteractive setting the
curator publishes some information
of arbitrary form, and the data is not
used further. Research statisticians
like to “look at the data,” and we have
frequently been asked for a method
of generating a “noisy table” that will
permit highly accurate answers to
be derived for computations that are
not specified at the outset. The noise
bounds say this is impossible: No such
table can safely provide very accurate
answers to too many weighted subset
sum questions; otherwise the table
could be used in a simulation of the
interactive mechanism, and an attack
could be mounted against the table.
Thus, even if the analyst only requires
the responses to a small number of
unspecified queries, the fact that the
table can be exploited to gain answers
to other queries is problematic.
In the case of “Internet scale” datasets, obtaining responses to, say,
n ≥ 108 queries is infeasible. What
happens if the curator permits only
a sublinear number of questions?
This inquiry led to the first algorithmic results in differential privacy, in
which it was shown how to maintain
privacy against a sublinear number of
counting queries, that is, queries of the
form “How many rows in the database
satisfy property P?” by adding noise of
order —less than the sampling
error —to each answer. 12 The cumbersome privacy guarantee, which focused
on the question of what an adversary
can learn about a row in the database,
is now known to imply a natural and
still very powerful relaxation of differential privacy, defined here.
“What” is hard
Newspaper horror stories about
“anonymized” and “de-identified”
data typically refer to noninteractive
approaches in which certain kinds
of information in each data record
have been suppressed or altered. A
famous example is AOL’s release of
a set of “anonymized” search query
it has taken
several years
to fully appreciate
the importance
of taking auxiliary
information into
account in
privacy-preserving
data release.
logs. People search for many “
obviously” disclosive things, such as their
full names (vanity searches), their own
social security numbers (to see if their
numbers are publicly available on the
Web, possibly with a goal of assessing
the threat of identity theft), and even
the combination of mother’s maiden
name and social security number. AOL
carefully redacted such obviously disclosive “personally identifiable information,” and each user id was replaced
by a random string. However, search
histories can be very idiosyncratic, and
a New York Times reporter correctly
traced such an “anonymized” search
history to a specific resident of Georgia.
In a linkage attack, released data
are linked to other databases or other
sources of information. We use the
term auxiliary information to capture
information about the respondents
other than that which is obtained
through the (interactive or noninteractive) statistical database. Any priors,
beliefs, or information from newspapers, labor statistics, and so on, all fall
into this category.
In a notable demonstration of
the power of auxiliary information,
medical records of the governor of
Massachusetts were identified by
linking voter registration records to
“anonymized” Massachusetts Group
Insurance Commission (GIC) medical encounter data, which retained
the birthdate, sex, and zip code of the
patient. 22
Despite this exemplary work, it has
taken several years to fully appreciate the importance of taking auxiliary
information into account in privacy-preserving data release. Sources and
uses of auxiliary information are endlessly varied. As a final example, it has
been proposed to modify search query
logs by mapping all terms, not just the
user ids, to random strings. In
token-based hashing each query is tokenized,
and then an uninvertible hash function
is applied to each token. The intuition
is that the hashes completely obscure
the terms in the query. However, using
a statistical analysis of the hashed
log and any (unhashed) query log, for
example, the released AOL log discussed above, the anonymization can
be severely compromised, showing
that token-based hashing is unsuitable
for anonymization. 17