lished approximately 7. 7 billion linearly
independent statistics, including 2. 7
billion in the PL94-171 redistricting file,
2. 8 billion in the balance of summary
file 1, 2 billion in summary file 2, and 31
million records in a public-use microdata sample. This results in approximately 25 statistics per person. Given
these numbers and the example in this
article, it is clear that there is a theoretical possibility the national-level census
could be reconstructed, although tools
such as Sugar and PicoSAT are probably
not powerful enough to do so.
To protect the privacy of census respondents, the Census Bureau is developing a
privacy-protection system based on differential privacy. This system will ensure
every statistic and the corresponding microdata receive some amount of privacy
protection, while providing that the resulting statistics are sufficiently accurate
for their intended purpose.
This article has explained the motivation for the decision to use differential
privacy. Without a privacy-protection system based on noise injection, it would be
possible to reconstruct accurate microdata using only the published statistics.
By using differential privacy, we can add
the minimum amount of noise necessary to achieve the Census Bureau’s privacy requirements. A future article will
explain how that system works.
In 2003, Irit Dinur and Kobbi Nissim4
showed the amount of noise that must
be added to a database to prevent a reconstruction of the underlying data
is on the order of Ω(√n) where n is the
number of bits in the database. In practice, many statistical agencies do not
add this much noise when they release
statistical tables. (In our example, each
record contains 11 bits of data, so the
confidential database has 77 bits of
information. Each statistic in Table 3
can be modeled as a four-bit of count, a
seven-bit of median, and a seven-bit of
mean, for a total of 18 bits; Table 3 releases 126 bits of information.) Dinur
and Nissim’s primary finding is that
many statistical agencies leave themselves open to the risk of database
reconstruction. This article demonstrates one way to conduct that attack.
Statistical tables create the possibility
of database reconstruction because they
form a set of constraints for which there
is ultimately only one exact solution
when the published table is correctly tab-
ulated from a real confidential database.
Restricting the number or specific types
of queries—for example, by suppressing
results from a small number of respon-
dents—is often insufficient to prevent
access to indirectly identifying informa-
tion, because the system’s refusal to an-
swer a “dangerous” query itself provides
the attacker with information.
With the dramatic improvement in
both computer speeds and the efficiency of SAT and other NP-hard solvers in
the last decade, DRAs on statistical databases are no longer just a theoretical
danger. The vast quantity of data products published by statistical agencies
each year may give a determined attacker more than enough information
to reconstruct some or all of a target
database and breach the privacy of millions of people. Traditional disclosure-avoidance techniques are not designed
to protect against this kind of attack.
Faced with the threat of database reconstruction, statistical agencies have
two choices: they can either publish
dramatically less information or use
some kind of noise injection. Agencies
can use differential privacy to determine the minimum amount of noise
necessary to add, and the most efficient way to add that noise, in order to
achieve their privacy protection goals.
Robert Ashmead, Chris Clifton, Kobbi
Nissim, and Philip Leclerc provided extraordinarily useful comments on this
article. Naoyuki Tamura provided invaluable help regarding the use of Sugar.
Go Static or Go Home
Privacy, Anonymity, and Big Data in the
Jon P. Daries et al.
Research for Practice: Private Online
Communication; Highlights in Systems
Albert Kwon, James Wilcox
1. Adam, N.R., Worthmann, J.C. Security-control
methods for statistical databases: A comparative
study. ACM Computing Surveys 21, 4 (1989), 515–556;
2. Biere, A. PicoSAT essentials. J. Satisfiability, Boolean
Modeling and Computation 4, (2008), 75–97; https://
3. Blum, A., Dwork, C., McSherry, F. Nissim, K. Practical
privacy: The SuLQ framework. In Proceedings of the
24th ACM SIGMOD-SIGACT-SIGART Symposium
on Principles of Database Systems, 2005, 128–138;
4. Dinur, I., Nissim, K. Revealing information while
preserving privacy. In Proceedings of the 22nd ACM
SIGMOD-SIGACT-SIGART Principles of Database
Systems, 2003, 202–210; https://dl.acm.org/citation.
5. Dwork, C., McSherry, F., Nissim, K., Smith, A. 2006.
Calibrating noise to sensitivity in private data
analysis. In Proceedings of the 3rd Conference
on Theory of Cryptography, 2006, 265–284.
Springer-Verlag, Berlin, Heidelberg; http://dx.doi.
6. Dwork, C., Nissim, K. Privacy-preserving datamining
on vertically partitioned databases. In Proceedings
of the 24th International Cryptology Conference
3152, 2004, 528–544. Springer Verlag, Santa
Barbara, CA; https://bit.ly/2zKunmJ
7. Heule, M.J.H., Kullmann, O. The science of brute force.
Commun. ACM 60, 8 (Aug. 2017), 70–79; http://doi.
8. Kleinberg, J., Papadimitriou, C., Raghavan, P. Auditing
Boolean attributes. In Proceedings of the 19th ACM
SIGMOD-SIGACT-SIGART Symposium on Principles
of Database Systems, 2000, 86–91; http://doi.acm.
9. Kong, S., Malec, D. Cook-Levin theorem. Lecture,
University of Wisconsin, 2007.
10. Marques-Silva, J., Lynce, I., Malik, S. Conflict-driven clause learning SAT solvers. Handbook of
Satisfiability, 131–153. IOS Press, Amsterdam,
The Netherlands, 2009.
11. McCarthy, J. Recursive functions of symbolic
expressions and their computation by machine, part I.
Commun. ACM 3, 4 (Apr. 1960), 184–195;
12. Metin, H., Baarir, S., Colange, M., Kordon, F.
CDCLSym: Introducing effective symmetry
breaking in SAT solving. Tools and Algorithms
for the Construction and Analysis of Systems. D.
Beyer and M. Huisman, eds. Springer International
Publishing, 2018, 99–114; https://link.springer.com/
13. Tamura, N., Taga, A., Kitagawa, S., Banbara, M.
Compiling finite linear CSP into SAT. Contraints
14, 2 (2009), 254–272; https://dl.acm.org/citation.
14. Whitney, C.R. Jeanne Calment, world’s elder,
dies at 122. New York Times (Aug. 5, 1997);
15. Zayatz, L., Lucero, J., Massell, P., Ramanayake, A.
Disclosure avoidance for Census 2010 and American
Community Survey five-year tabular data products.
Statistical Research Division, U.S. Census Bureau;
Simson L. Garfinkel is the Senior Computer Scientist
for Confidentiality and Data Access at the U.S. Census
Bureau and the Chair of the Census Bureau’s Disclosure
John M. Abowd is the Chief Scientist and Associate
Director for Research and Methodology at the U.S. Census
Bureau, where he serves on leave from his position as the
Edmund Ezra Day Professor of Economics, professor of
information science, and member of the Department of
Statistical Sciences at Cornell University, Ithaca, NY, USA.
Christian Martindale is a senior at Duke University,
Durham, NC, USA.
Copyright held by authors/owners.
Publication rights licensed to ACM.