I
M
A
G
E
B
Y
I
G
O
R
S
T
E
V
A
N
O
V
I
C
Of course, this is an extremely simple
example. Statistical agencies have understood the risk of such unintended
disclosure for decades and have developed a variety of techniques to protect
data confidentiality while still publishing useful statistics. These techniques
include cell suppression, which prohibits publishing statistical summaries
from small groups of respondents;
top-coding, in which ages higher than a certain limit are coded as that limit before
statistics are computed; noise-injection,
in which random values are added to
some attributes; and swapping, in which
some of the attributes of records representing different individuals or families
are swapped. Together, these techniques are called statistical disclosure
limitation (SDL).
Computer scientists started exploring the issue of statistical privacy in the
1970s with the increased availability of
interactive query systems. The goal was
to build a system that would allow us-
ers to make queries that would pro-
duce summary statistics without re-
vealing information about individual
records. Three approaches emerged:
auditing database queries, so that us-
ers would be prevented from issuing
queries that zeroed in on data from
specific individuals; adding noise to
the data stored within the database;
and adding noise to query results.
1 Of
these three, the approaches of adding
noise proved to be easier because the
complexity of auditing queries in-
creased exponentially over time—and,
in fact, was eventually shown to be NP
(nondeterministic polynomial)-hard.
8
Although these results were all couched
in the language of interactive query sys-
tems, they apply equally well to the ac-
tivities of statistical agencies, with the
database being the set of confidential
survey responses, and the queries being
the schedule of statistical tables that
the agency intends to publish.
In 2003, Irit Dinur and Kobbi Nis-
sim showed that it isn’t even necessary
for an attacker to construct queries on
a database carefully to reveal its under-
lying confidential data.
4 Even a surpris-
ingly small number of random queries
can reveal confidential data, because
the results of the queries can be com-
bined and then used to “reconstruct”
the underlying confidential data. Add-
ing noise to either the database or to
the results of the queries decreases the
accuracy of the reconstruction, but it
also decreases the accuracy of the que-
ries. The challenge is to add sufficient
noise in such a way that each individu-
al’s privacy is protected, but not so
much noise that the utility of the data-
base is ruined.
Subsequent publications3, 6 refined
the idea of adding noise to published
tables to protect the privacy of the individuals in the dataset. Then in 2006,
Cynthia Dwork, Frank McSherry, Kobbi
Nissim, and Adam Smith proposed a
formal framework for understanding
these results. Their paper, “Calibrating
Noise to Sensitivity in Private Data Analysis,”
5 introduced the concept of differ-