to the holdout set and gives a valid estimate of classifier
accuracy.
High correlation between labels and some of the variables. In our second experiment, the class labels are correlated with some of the variables. As before the label is
randomly chosen from {− 1, 1} and each of the attributes is
drawn from N(0, 1) aside from 20 attributes which are drawn
from N( y ⋅ 0.06, 1) where y is the class label. We execute the
same algorithm on this data with both the standard holdout
and Thresholdout and plot the results in Figure 3. Our experiment shows that when using the reusable holdout, the
algorithm still finds a good classifier while preventing over-fitting. This illustrates that the reusable holdout simultaneously prevents overfitting and allows for the discovery of true
statistical patterns.
Acknowledgments
We would like to thank S. Arora, M.F. Balcan, A. Blum,
D. Foster, M. Kearns, J. Kleinberg, A. Rakhlin, P. Rigollet,
W. Su, and J. Ullman for the enlightening discussions about
this work. We also thank the Simons Institute for Theoretical
Computer Science at Berkeley where part of this research
was done. This work was supported by the Alfred P. Sloan
Foundation and NSF grant CNS 1253345. Cynthia Dwork ( dwork@microsoft.com),
Microsoft Research, Mountain View, CA.
Vitaly Feldman ( vitaly@post.harvard.edu),
IBM Almaden Research Center, CA.
Moritz Hardt ( m@mrtz.org), Google
Research, Mountain View, CA.
Toniann Pitassi ( toni@cs.toronto.
edu), Department of Computer Science,
University of Toronto, Toronto,
Ontario, Canada.
Omer Reingold ( reingold@stanford.edu),
Computer Science Department, Stanford
University, Stanford, CA.
Aaron Roth ( aaroth@cis.upenn.edu),
Department of Computer and Information
Science, University of Pennsylvania,
Philadelphia, PA.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00.
Figure 3. Some variables are correlated with the label.
Training
Holdout
Fresh
Standard holdout
Number of variables
A
c
c
ura
c
y
0.70
0.65
0.60
0.55
0.50
0 100 200 300 400 500
Training
Holdout
Fresh
Thresholdout
Number of variables
A
c
cur
a
cy
0.70
0.65
0.60
0.55
0.50
0 100 200 300 400 500
References
1. Aschwanden, C. Science isn’t
broken.
2. Bassily, R., Nissim, K., Smith, A.D.,
Steinke, T., Stemmer, U., Ullman, J.
Algorithmic stability for adaptive
data analysis. In S TOC, Cambridge,
MA, USA (2016), 1046–1059.
3. Benjamini, Y., Hochberg, Y. Controlling
the false discovery rate – A practical
and powerful approach to multiple
testing. J. R. Stat. Soc. Ser. B 57
(1995), 289–300.
4. Blum, A., Dwork, C., McSherry, F.,
Nissim, K. Practical privacy: The SuLQ
framework. In PODS, Baltimore,
Maryland, USA (2005), 128–138.
5. Blum, A., Hardt, M. The ladder: A
reliable leaderboard for machine
learning competitions. In ICML, Lille,
France (2015), 1006–1014.
6. Bousquet, O., Elisseeff, A. Stability
and generalization. JMLR 2 (2002),
499–526.
7. Cawley, G.C., Talbot. N.L.C. On
over-fitting in model selection
and subsequent selection bias in
performance evaluation. J. Mach.
Learn. Res. 11 (2010), 2079–2107.
8. Dwork, C. A firm foundation for
private data analysis. CACM 54, 1
(2011), 86–95.
9. Dwork, C., Feldman, V., Hardt, M.,
Pitassi, T., Reingold, O., Roth, A.
Preserving statistical validity in
adaptive data analysis. CoRR,
abs/1411.2664, 2014. Extended
abstract in STOC 2015.
10. Dwork, C., Feldman, V., Hardt, M.,
Pitassi, T., Reingold, O., Roth, A.
Generalization in adaptive data
analysis and holdout reuse. CoRR,
abs/1506, 2015. Extended abstract in
NIPS 2015.
11. Dwork, C., McSherry, F., Nissim, K.,
Smith, A. Calibrating noise to
sensitivity in private data analysis.
In TCC, New York, NY, USA (2006),
265–284.
12. Dwork, C., Roth, A. The algorithmic
foundations of differential privacy.
Found. Trends Theor. Comput.
Sci. 9, 34 (2014), 211–407.
13. Freedman, D. A. A note on screening
regression equations. Am. Statist. 37,
2 (1983), 152–155.
14. Gelman, A., Loken, E. The statistical
crisis in science. Am. Statist. 102, 6
(2014), 460.
15. Hardt, M., Rothblum, G. A multiplicative
weights mechanism for privacy-preserving data analysis. In FOCS, Las
Vegas, Nevada, USA (2010), 61–70.
16. Hardt, M., Ullman, J. Preventing false
discovery in interactive data analysis
is hard. In FOCS, Philadelphia, PA,
USA (2014), 454–463.
17. Hastie, T., Tibshirani, R., Friedman, J. H.
The Elements of Statistical Learning:
Data Mining, Inference, and Prediction.
Springer Series in Statistics. Springer-Verlag, New York (2009).
18. Ioannidis, J. P.A. Why most published
research findings are false. PLoS Med.
2, 8 (Aug. 2005), 124.
19. Kearns, M. Efficient noise-tolerant learning from statistical
queries. J. ACM 45, 6 (1998),
983–1006.
20. Mukherjee, S., Niyogi, P., Poggio, T.,
Rifkin, R. Learning theory: Stability
is sufficient for generalization
and necessary and sufficient
for consistency of empirical risk
minimization. Adv. Comput. Math. 25,
1–3 (2006), 161–193.
21. Nissim, K., Stemmer, U. On the
generalization properties of
differential privacy. CoRR (2015),
abs/1504.05800.
22. Reunanen, J. Overfitting in making
comparisons between variable
selection methods. J. Mach Learn
Res. 3 (2003), 1371–1382.
23. Shalev-Shwartz, S., Ben-David, S.
Understanding Machine Learning:
From Theory to Algorithms.
Cambridge University Press, 32
Avenue of the Americas, New York,
NY 10013-2473, USA (2014).
24. Shalev-Shwartz, S., Shamir, O.,
Srebro, N., Sridharan, K. Learnability,
stability and uniform convergence.
J. Mach Learn Res. 11 (2010),
2635–2670.
25. Steinke, T., Ullman, J. Interactive
fingerprinting codes and the hardness
of preventing false discovery. In COLT,
Paris, France (2015), 1588–1628.