review articles
Doi: 10.1145/1866739.1866758
What does it mean to preserve privacy?
BY cYnthia DWoRk
a firm
foundation
for Private
Data analysis
This long history is a testament to
the importance of the problem. Statistical databases can be of enormous
social value; they are used for apportioning resources, evaluating medical
therapies, understanding the spread
of disease, improving economic utility, and informing us about ourselves
as a species.
The data may be obtained in diverse
ways. Some data, such as census,
tax, and other sorts of official data,
is compelled; other data is collected
opportunistically, for example, from
traffic on the Internet, transactions
on Amazon, and search engine query
logs; other data is provided altruistically, by respondents who hope that
sharing their information will help
others to avoid a specific misfortune,
or more generally, to increase the
public good. Altruistic data donors
are typically promised their individual data will be kept confidential—in
short, they are promised “privacy.”
Similarly, medical data and legally
compelled data, such as census data
and tax return data, have legal privacy
IN the INfoRmatIoN realm, loss of privacy is usually
associated with failure to control access to
information, to control the flow of information, or
to control the purposes for which information is
employed. Differential privacy arose in a context
in which ensuring privacy is a challenge even if all
these control problems are solved: privacy-preserving
statistical analysis of data.
The problem of statistical disclosure control—
revealing accurate statistics about a set of respondents
while preserving the privacy of individuals—has
a venerable history, with an extensive literature
spanning statistics, theoretical computer science,
security, databases, and cryptography (see,
for example, the excellent survey of Adam and
Wortmann, 1 the discussion of related work in Blum et
al., 2 and the Journal of Official Statistics dedicated to
confidentiality and disclosure control).
key;insights
in analyzing private data, only
by focusing on rigorous privacy
guarantees can we convert the cycle
of “propose-break-propose again”
into a path of progress.
a natural approach to defining privacy is
to require that accessing the database
teaches the analyst nothing about any
individual. But this is problematic: the
whole point of a statistical database is to
teach general truths, for example, that
smoking causes cancer. Learning this
fact teaches the data analyst something
about the likelihood with which
certain individuals, not necessarily
in the database, will develop cancer.
We therefore need a definition that
separates the utility of the database
(learning that smoking causes cancer)
from the increased risk of harm due to
joining the database. this is the intuition
behind differential privacy.
this can be achieved, often with low
distortion. the key idea is to randomize
responses so as to effectively hide the
presence or absence of the data of any
individual over the course of the lifetime
of the database.