be the same for all users. It typically
includes ZIP code, birth date, gender,
and/or other demographics. The rest
of the attributes are assumed to be
non-identifying. De-identification involves modifying the quasi-identifiers
to satisfy various syntactic properties,
such as “every combination of quasi-identifier values occurring in the dataset must occur at least k times.” 6 The
trouble is that even though joining two
datasets on common attributes can
lead to re-identification, anonymizing
a predefined subset of attributes is not
sufficient to prevent it.
Re-identification without Pii
Any information that distinguishes
one person from another can be used
for re-identifying anonymous data.
Examples include the AOL fiasco, in
which the content of search queries
was used to re-identify a user; our own
work, which demonstrated feasibility
of large-scale re-identification using
movie viewing histories (or, in general,
any behavioral or transactional profile2) and local structure of social networks; 3 and re-identification based on
location information and stylometry
(for example, the latter was used to infer the authorship of the 12 disputed
Federalist Papers).
Re-identification algorithms are agnostic to the semantics of the data elements. It turns out there is a wide spectrum of human characteristics that
enable re-identification: consumption
preferences, commercial transactions, Web browsing, search histories,
and so forth. Their two key properties
are that ( 1) they are reasonably stable
across time and contexts, and ( 2) the
corresponding data attributes are sufficiently numerous and fine-grained
that no two people are similar, except
with a small probability.
The versatility and power of re-identification algorithms imply that terms
such as “personally identifiable” and
“quasi-identifier” simply have no technical meaning. While some attributes
may be uniquely identifying on their
own, any attribute can be identifying in
combination with others. Consider, for
example, the books a person has read
or even the clothes in her wardrobe:
while no single element is a (
quasi)-identifier, any sufficiently large subset
uniquely identifies the individual.
Re-identification algorithms based
on behavioral attributes must tolerate a certain “fuzziness” or imprecision in attribute values. They are thus
more computationally expensive and
more difficult to implement than re-identification based on demographic
quasi-identifiers. This is not a significant deterrence factor, however, because re-identification is a one-time effort and its cost can be amortized over
thousands or even millions of individuals. Further, as Paul Ohm argues, re-identification is “accretive”: the more
information about a person is revealed
as a consequence of re-identification,
the easier it is to identify that person in
the future. 4
Lessons for Privacy Practitioners
The emergence of powerful re-identification algorithms demonstrates not
just a flaw in a specific anonymization
technique(s), but the fundamental
inadequacy of the entire privacy protection paradigm based on “
de-identi-fying” the data. De-identification provides only a weak form of privacy. It may
prevent “peeping” by insiders and keep
honest people honest. Unfortunately,
advances in the art and science of re-identification, increasing economic
incentives for potential attackers, and
ready availability of personal information about millions of people (for example, in online social networks) are
rapidly rendering it obsolete.
The PII fallacy has important implications for health-care and biomedical
datasets. The “safe harbor” provision
of the HIPAA Privacy Rule enumerates
18 attributes whose removal and/or
modification is sufficient for the data
to be considered properly de-identified, with the implication that such
data can be released without liability.
This appears to contradict our argument that PII is meaningless. The “safe
harbor” provision, however, applies
only if the releasing entity has “no actual knowledge that the information
remaining could be used, alone or in
combination, to identify a subject of
the information.” As actual experience
has shown, any remaining attributes
can be used for re-identification, as
long as they differ from individual to
individual. Therefore, PII has no meaning even in the context of the HIPAA
Privacy Rule.
Beyond De-identification
Developing effective privacy protection
technologies is a critical challenge for
security and privacy research. While
much work remains to be done, some
broad trends are becoming clear, as
long as we avoid the temptation to find
a silver bullet. Differential privacy is a
major step in the right direction. 1 Instead of the unattainable goal of “
de-identifying” the data, it formally defines what it means for a computation
to be privacy-preserving. Crucially, it
makes no assumptions about the external information available to the adversary. Differential privacy, however,
does not offer a universal methodology
for data release or collaborative, privacy-preserving computation. This limitation is inevitable: privacy protection
has to be built and reasoned about on
a case-by-case basis.
Another lesson is that an interactive, query-based approach is generally
superior from the privacy perspective
to the “release-and-forget” approach.
This can be a hard pill to swallow, because the former requires designing
a programming interface for queries,
budgeting for server resources, performing regular audits, and so forth.
Finally, any system for privacy-preserving computation on sensitive data
must be accompanied by strong access
control mechanisms and non-techno-logical protection methods such as informed consent and contracts specifying acceptable uses of data.
References
1. Dwork, C. A firm foundation for private data analysis.
Commun. ACM. (to appear).
2. narayanan, A. and Shmatikov, V. Robust de-anonymization of large sparse datasets. In
Proceedings of the 2008 IEEE Symposium on Security
and Privacy.
3. narayanan, A. and Shmatikov, V. De-anonymizing
social networks. In Proceedings of the 2009 IEEE
Symposium on Security and Privacy.
4. ohm, P. broken promises of privacy: Responding to
the surprising failure of anonymization. 57 uCLA Law
review 57, 2010 (to appear).
5. Sweeney, L. Weaving technology and policy together
to maintain confidentiality. J. of Law, Medicine, and
Ethics 25 (1997).
6. Sweeney, L. Achieving k-anonymity privacy protection
using generalization and suppression. International
Journal on uncertainty, Fuzziness, and Knowledge-Based Systems 10 (2002).
Arvind narayanan ( arvindn@cs.utexas.edu) is a
postdoctoral fellow at Stanford university. Vitaly
Shmatikov ( shmat@cs.utexas.edu) is an associate
professor of computer science at the university of Texas
at Austin. Their paper on de-anonymization of large sparse
datasets2 received the 2008 PET Award for outstanding
Research in Privacy Enhancing Technologies.