By Johannes Gehrke
gOVernMent agencieS WOrld Wide are
required to release statistical information about population, education, and
health, crime, and economic activities.
In the U.S., protecting this data goes
back to the 19th century when Carrol
Wright, the first head of the Bureau
of Labor Statistics, which was established in 1885, argued that protecting
the confidentiality of the Bureau’s data
was necessary. If enterprises feared
that data about an enterprise collected
by the Bureau would be shared with
competitors, investigators, or the tax
authorities, data quality would severely
suffer. The field of statistical disclosure
limitation was born. 4
Fast-forward a few decades, Stanley
Warner was faced with a similar conundrum. During interviews for market surveys, individuals would refuse
questions of sensitive or controversial
issue “for reasons of modesty, fear of
being thought bigoted, or merely a reluctance to confide secrets to strangers.” 7 His answer was a technique
where the interviewee would flip a
biased coin without showing the outcome to the interviewer. Depending
on the outcome of the coin flip, the
interviewee would (truthfully) answer
either the original yes/no question or
she would negate her answers. This
method intuitively protects the interviewee since her answer could always
have been due to the coin flipping on
the other side.
Tore Dalenius formulated a very
strong notion of protection a decade
later: 2 “If the release of the statistic S
makes it possible to determine the (
mi-crodata) value more accurately than
without access to S, a disclosure has
taken place…”. This very strong notion
of semantic security implies that data
publishers should think about adversaries and their knowledge since the
published data could give new information to an adversary.
Fast-forward a few more decades
to the turn of the century. Statisti-
cians have developed many different
methods for limiting disclosure when
publishing data such as suppression,
sampling, swapping, generalization
(also called coarsening), synthetic
data generation, data perturbation,
and the publishing of marginals for
contingency tables, just to name a
few. These methods are applied in
practice, but they do not provide for-
mal privacy guarantees—the methods
do not formally state how much an
attacker can learn, and they preserve
confidentiality by hiding the param-
1. Agrawal, r. and Srikant, r. privacy-preserving data
mining. In Proceedings of ACM SIGMOD (May 2000).
AcM press, ny.
2. dalenius, t. towards a methodology for statistical
disclosure control. Statistik Tidskrift 15 (1977), 429-444.
3. dwork, c., McSherry, f., nissim, K. and Smith, A.
calibrating noise to sensitivity in private data analysis.
In Proceedings of the 2006 TCC Conference (Mar.
2006), Springer-Verlag, 265-284.
4. goldberg, j.p. and Moye, w.t. the first hundred years
of the bureau of labor Statistics. Bureau of Labor
5. lindell, y. and pinkas, b. privacy preserving data
mining. In Proceedings of Crypto ‘00 (Aug. 2000),
6. u. S. census bureau’s longitudinal employer-household dynamics program ontheMap Application;
7. warner, S. randomized response: A survey technique
for eliminating evasive answer bias. Journal of the
American Statistical Association (1965), 63-69.
Johannes Gehrke ( email@example.com) is a
professor in the department of computer Sciences at
cornell university, Ithaca, ny.