interleaved arbitrarily with other data
in the stream.
As an example, consider the problem of density estimation. Assuming,
for simplicity, that the data stream is
just a sequence of IP addresses in a
certain range, we wish to know what
fraction of the set of IP addresses in the
range actually appears in the stream.
A solution inspired by randomized
response can be designed using the following technique. 10
Define two probability distributions, D0 and D1, on the set {0, 1}. D0
assigns equal mass to zero and to one.
D1 has a slight bias toward 1; specifically, 1 has mass 1/2 + e/4, while 0 has
mass 1/2 − e/4.
Let X denote the set of all possible IP
addresses in the range of interest. The
algorithm creates a table, with a 1-bit
entry bx for each x Î X, initialized to an
independent random draw from D0. So
initially the table is roughly half zeroes
and half ones.
In an atomic step, the algorithm
receives an element from the stream,
changes state, and discards the element. When processing x Î X, the
algorithm makes a fresh random
draw from D1, and stores the result in
bx. This is done no matter how many
times x may have appeared in the
past. Thus, for any x appearing at least
once, bx will be distributed according to D1. However, if x never appears,
then the entry for x is the bit drawn
according to D0 during the initializa-tion of the table.
As with randomized response,
the density in X of the items in the
stream can be approximated from
the number of 1’s in the table, taking
into account the expected fraction of
“false positives” from the initializa-tion phase and the “false negatives”
when sampling from D1. Letting q
denote the fraction of entries in the
table with value 1, the output is 4(q −
1/2)/e + Lap(1/e|X|).
Intuitively, the internal state is
differentially private because, for
each
privacy for the output is ensured by
the addition of Laplacian noise. Over
all, the algorithm is 2e-differentially
pan-private.
conclusion
The differential privacy frontier is
expanding rapidly, and there is insuf-
ficient space here to list all the inter-
esting directions currently under
investigation by the community. We
identify a few of these.
The Geometry of Differential Privacy.
Sharper upper and lower bounds
on noise required for achieving differential privacy against a sequence
of linear queries can be obtained
by understanding the geometry of
the query sequence. 14 In some cases
dependencies among the queries can
be exploited by the curator to markedly
improve the accuracy of the responses.
Generalizing this investigation to the
nonlinear and interactive cases would
be of significant interest.
Algorithmic Complexity. We have
so far ignored questions of computational complexity. Many, but not
all, of the techniques described here
have efficient implementations. For
example, there are instances of the
synthetic data generation problem
that, under standard cryptographic
assumptions, have no polynomial
time implementation. 11 It follows
that there are cases in which the exponential mechanism has no efficient
implementation. When can this powerful tool be implemented efficiently,
and how?
An Alternative to Differential Privacy?
Is there an alternative, “ad omnia,”
guarantee that composes automatically, and permits even better accuracy
than differential privacy? Can cryptography be helpful in this regard? 20
The work described herein has,
for the first time, placed private data
analysis on a strong mathematical
foundation. The literature connects
differential privacy to decision theory,
economics, robust statistics, geometry, additive combinatorics, cryptography, complexity theory learning
theory, and machine learning.
Differential privacy thrives because
it is natural, it is not domain-specific,
and it enjoys fruitful interplay with
other fields. This flexibility gives hope
for a principled approach to privacy
in cases, like private data analysis,
where traditional notions of cryptographic security are inappropriate or
impracticable.
References
1. Adam, N.R., Wortmann, J. Security-control methods
for statistical databases: A comparative study. ACM
Comput. Surv. 21 (1989), 515–556.
2. Blum, A., Dwork, C., McSherry, F., Nissim, K.
Practical privacy: The SuLQ framework. In
Proceedings of the 24th ACM Symposium on
Principles of Database Systems (2005), 128–138.
3. Blum, A., Ligett, K., Roth, A. A learning theory
approach to non-interactive database privacy. In
Proceedings of the 40th ACM Symposium on Theory
of Computing (2008), 609–618.
4. Denning, D.E. Secure statistical databases with
random sample queries. ACM Trans. Database Syst.
5 (1980), 291–315.
5. Dinur, I., Nissim, K. Revealing information while
preserving privacy. In Proceedings of the 22nd ACM
Symposium on Principles of Database Systems
(2003), 202–210.
6. Dwork, C. Differential privacy. In Proceedings of
the 33rd International Colloquium on Automata,
Languages and Programming (ICALP) ( 2) (2006),
1–12.
7. Dwork, C., McSherry, F., Nissim, K., Smith, A.
Calibrating noise to sensitivity in private data
analysis. In Proceedings of the 3rd Theory of
Cryptography Conference (2006), 265–284.
8. Dwork, C., McSherry, F., Talwar, K. The price of
privacy and the limits of lp decoding. In Proceedings
of the 39th ACM Symposium on Theory of
Computing (2007), 85–94.
9. Dwork, C. Naor, M. On the difficulties of disclosure
prevention in statistical databases or the case for
differential privacy. J. Privacy Confidentiality 2
(2010). Available at: http://repository.cmu.edu/jpc/
vol2/iss1/8.
10. Dwork, C., Naor, M., Pitassi, T., Rothblum, G.,
Yekhanin, S. Pan-private streaming algorithms. In
Proceedings of the 1st Symposium on Innovations in
Computer Science (2010).
11. Dwork, C., Naor, M., Reingold, O., Rothblum, G.,
Vadhan, S. When and how can privacy-preserving
data release be done efficiently? In Proceedings of
the 41st ACM Symposium on Theory of Computing
(2009), 381–390.
12. Dwork, C., Nissim, K. Privacy-preserving datamining
on vertically partitioned databases. In Advances in
Cryptology—CRYPTO’04 (2004), 528–544.
13. Goldwasser, S., Micali, S. Probabilistic encryption.
JCSS 28 (1984), 270–299.
14. Hardt, M., Talwar, K. On the geometry of differential
privacy, (2009). In Proceedings of the 42nd ACM
Symposium on Theory of Computing (2010),
705–714.
15. Kenthapadi K., Mishra, N., Nissim, K. Simulatable
auditing. In Proceedings of the 24th ACM
Symposium on Principles of Database Systems
(2005), 118–127.
16. Kleinberg, J., Papadimitriou, C., Raghavan, P.
Auditing boolean attributes. In Proceedings of the
19th ACM Symposium on Principles of Database
Systems (2000), 86–91.
17. Kumar, R., Novak, J., Pang, B., Tomkins, A. On
anonymizing query logs via token-based hashing. In
Proceedings of the WWW 2007 (2007), 629–638.
18. McSherry, F. Privacy integrated queries (codebase).
Available on Microsoft Research downloads website.
See also Proceedings of SIGMOD (2009), 19–30.
19. McSherry, F., Talwar, K. Mechanism design via
differential privacy. In Proceedings of the 48th
Annual Symposium on Foundations of Computer
Science (2007).
20. Mironov, I., Pandey, O., Reingold, O., Vadhan, S.
Computational differential privacy. In Advances in
Cryptology—CRYPTO’09 (2009), 126–142.
21. Rubin, D. Discussion: Statistical disclosure limitation.
J. Official Statist. 9 (1993), 462–468.
22. Sweeney, L. Weaving technology and policy together
to maintain confidentiality. J. Law Med. Ethics 25
(1997), 98–110.
23. Warner, S. Randomized response: a survey technique
for eliminating evasive answer bias. JASA (1965),
63–69.
Cynthia Dwork ( dwork@microsoft.com) is a principal
researcher at Microsoft Research, Silicon Valley Campus,
Mountain View, CA.