important goal in machine learning is to obtain a predictive model that generalizes well, that is, a model whose
accuracy on the data is representative of its accuracy on
future data generated by the same process. Indeed, a large
body of theoretical and empirical research was developed
for ensuring generalization in a variety of settings. In
theoretical work, it is commonly assumed that the learning algorithm operates on a freshly sampled dataset. In
practice, instead, a dataset is split randomly into two (or
sometimes more) parts: the training set and the testing, or
holdout, set. The training set is used for learning a predictor and the holdout set is used to estimate the true accuracy of the predictor. Because the predictor is independent
of the holdout dataset, such an estimate is a valid estimate
of the true prediction accuracy.
However, in practice the holdout dataset is rarely used
only once. One prominent example in which a holdout set is
often adaptively reused is hyperparameter tuning. Similarly,
the holdout set in a machine learning competition, such as
the famous ImageNet competition, is typically reused many
times adaptively. Other examples include using the holdout set for variable selection, generation of base learners
(in aggregation techniques, such as boosting and bagging),
checking a stopping condition, and analyst-in-the-loop decisions. Such reuse is known to lead to overfitting to the holdout set (e.g., Refs.
7, 22).
The literature recognizes the risks and proposes solutions in a number of special cases of adaptive data analysis. Most of them address a single round of adaptivity such
as variable selection followed by regression on selected
variables or model selection followed by testing and are
optimized for specific inference procedures (see Chapter 7
in Ref.
17 for an overview). Yet, to our knowledge, there is
no prior work giving a general methodology for addressing the risks of adaptive data reuse over many rounds of
adaptivity and without restricting the type of procedures
that are performed. We describe such a methodology,
based on techniques from privacy-preserving data analysis, together with a concrete application we call the
reusable holdout.
2. OUR APPROACH AND RESULTSa
Let us establish some simple terminology. We represent a
data point as an element of some universe X and a dataset
consists of n data points. A data generating process gives
rise to a probability distribution over datasets. We will focus
on the most commonly studied setting in which each point
of the dataset is drawn randomly and independently from
some unknown distribution P over X. For example, the
dataset may contain the health information and habits of n
individuals, and the analyst is trying to learn about medical
conditions affecting the population from which the individuals were drawn randomly.
We view adaptive analysis as a process in which an analyst wishes to ask a sequence of queries on a given dataset.
Here a query could refer to an execution of some statistical
procedure, a learning algorithm, preprocessing step, or any
other inspection of the data. Crucially, after asking the first
t queries, the analyst can use the results of those queries to
pick the query performed at step t + 1. While our approach
can be applied to a very general definition of queries, for
simplicity we first focus on queries that estimate the mean
of a function φ: X → [0, 1] on the unknown distribution
P or P[φ] = Ex∼P[φ(x)]. The estimate is required to be correct
up to some additive error τ usually referred to as tolerance
with high probability. Such queries allow the analyst to learn
a variety of basic statistics of the population, for example,
the fraction of the population over six feet all. More generally, they allow the analyst to estimate the true means and
moments of individual attributes, correlations between
attributes and the accuracy of any predictive model. Such
queries are referred to as statistical in the context of the well-studied statistical query model19 and have also been studied
in statistics as linear statistical functionals. It is known that
many standard data analyses can be performed using access
to statistical queries instead of direct access to data (see
Refs.
4, 19 for examples).
Even in this relatively simple setting the question of
how many adaptively chosen statistical queries can be correctly answered using n samples drawn i.i.d. from P has
not been previously examined. The conservative approach
of using fresh samples for each adaptively chosen query
requires n to scale linearly with the number of queries m.
We observe that such a bad dependence is inherent in the
standard approach of estimating expectations by the exact
empirical average on the samples. This is directly implied
by “Freedman’s paradox”
13 and we describe an additional
simple example in Ref.
9 This situation is in stark contrast
to the nonadaptive case in which samples suffice to answer m queries with a tolerance τ using empirical
averages.
We demonstrate that the problem can be addressed using
techniques developed in the context of differential privacy,
a definition of privacy tailored to privacy-preserving data
analysis. Roughly speaking, differential privacy ensures that
the probability of observing any outcome from an analysis
is “essentially unchanged” by modifying any single dataset
element (the probability distribution is over the randomness introduced by the algorithm).
The central insight of the differentially private data analysis is that it is possible to learn statistical properties of a dataset, whereas controlling the amount of information revealed
about any dataset element. Our approach is based on the
same view of the adaptive data reuse problem: the analyst
can be prevented from overfitting to the data if the amount
of information about the data revealed to the analyst is limited. To ensure that information leakage is limited, the algorithm needs to control the access of the analyst to the data.
We show that this view can be made formal by introducing
the notion of maximum information between two random
variables. This notion allows us to bound the factor by which
uncertainty about the dataset is reduced given the output of
the algorithm on this dataset. We describe it in more detail
in Section 3. 1.
a Additional averaging over k different partitions is used in k-fold cross-validation.