APRIL 2017 | VOL. 60 | NO. 4 | COMMUNICATIONS OF THE ACM 91
to provide an estimate of P[φ]. To achieve this the algorithm
gives an estimate of in a way that prevents overfitting
of functions generated by the analyst to the holdout set.
In other words, responses of Thresholdout are designed to
ensure that, with high probability, is close to P[φ]
and hence an estimate of gives an estimate of the
true expectation P[φ]. Given a function φ, Thresholdout first
checks if the difference between the average value of φ on
the training set St (or ) and the average value of φ on
the holdout set Sh (or ) is below a certain threshold T + η.
Here, T is a fixed number such as 0.01 and η is a Laplace
noise variable whose standard deviation needs to be chosen
depending on the desired guarantees. If the difference is
below the threshold, then the algorithm returns . If the
difference is above the threshold, then the algorithm returns
for another Laplacian noise variable ξ. Each time
the difference is above threshold the “overfitting” budget B
is reduced by one. Once it is exhausted, Thresholdout stops
answering queries. In Figure 1, we provide the pseudocode
of Thresholdout.
We now state the formal generalization guarantees that
the entire execution of Thresholdout enjoys. They are based
on the privacy guarantees of the “Sparse Vector” technique
given in Chapter 3 of Ref.
12 and the generalization properties
of differential privacy. For pure differential privacy we rely
on Theorem 4 and for (ε, δ)-differential privacy we use the
bound in Ref.
21
Theorem 9. Let β, τ > 0 and m ≥ B > 0. We set T = 3τ/4 and
σ = τ/( 96 ln(4m /β ) ). Let S denote a holdout dataset of size n
drawn i.i.d. from a distribution P and St be any additional dataset over X. Consider an algorithm that is given access to St and
adaptively chooses functions φ1, . . ., φm while interacting with
Thresholdout which is given datasets S, St and values σ, B, T.
For every i ∈ [m], let ai denote the answer of Thresholdout
on function φi : X → [0, 1]. Further, for every i ∈ [m], we define
the counter of overfitting
Then
whenever n ≥ n0 for
Note that in the bound on n, the term is equal
(up to a constant factor) to the number of samples that
are necessary to answer m nonadaptively chosen queries
with tolerance τ and confidence 1 − β. Further, this bound
allows m to be exponentially large in n as long as B
grows subquadratically in n (that is, B ≤ n2−c for a constant
c > 0).
We remark that the same approach also works for the
class of low sensitivity queries. In Ref.,
10 we also give an
incomparable version of this algorithm with guarantees that
derive from description length arguments rather than from
From Theorem 8 and Corollary 7, we see that ensuring
τ2-differential privacy over the entire interaction with the
dataset strictly controls the probability that the adversary
can choose a function that overfits to the dataset. This
is somewhat worse than the claim in Theorem 4 which
requires τ-differential privacy. In Ref.,
10 we show that
by considering a simple relaxation of max-information,
referred to as approximate max-information, it is possible
to prove the stronger bound on max-information of differ-
entially private algorithms for datasets consisting of i.i.d.
samples. Interestingly, it is not hard to show that algo-
rithms whose output has short description length (in bits)
also have low approximate max-information. Thus approx-
imate max-information unifies generalization bounds
obtained via (pure) differential privacy and description
length. In addition, composition properties of approxi-
mate max-information imply that one can easily obtain
generalization guarantees for adaptive sequences of algo-
rithms, some of which are differentially private, and oth-
ers of which have outputs with short description length.
4. THE REUSABLE HOLDOUT
In this section, we describe a practical application of our
framework, which gives a method for safely reusing a holdout set many times. In this application, the analyst splits
the dataset into a training set and a holdout set. An advantage of this approach is that the data analyst will have full,
unrestricted access to the training set and can use it in any
way that she desires. The holdout set can only be accessed
through a reusable holdout algorithm. The goal of this algorithm is to validate the results of analyses performed on the
training set.
We describe a specific instantiation of reusable holdout, referred to as Thresholdout, that validates the values
of statistical queries and is based on the “Sparse Vector”
technique from differential privacy (e.g., Chapter 3 of
Ref.
12). Specifically, for every function φ : X → [0, 1] given
by the analyst, the algorithm checks if the empirical average of φ on the training set is close to the true mean of φ
(up to some tolerance τ). If the values are close the algorithm does not provide any additional information to
the analyst. Only if φ overfits the training set does the algorithm provide a valid estimate of the true expectation
of φ. The result is that for all of the queries that the analyst
asks, she has correct estimates of the true expectation—
either our algorithm certifies that the estimate from the
training set is approximately correct or else it provides a
correct estimate using the holdout set. The analysis of the
algorithm shows that the number of samples needed by
Thresholdout depends only logarithmically on the total
number of queries asked by the data analyst as long as the
total number of queries that overfit the training set (and
have to be answered using the holdout set) is not too large.
As a result, this simple and computationally efficient algorithm can potentially answer an exponential (in n) number
of queries.
More formally, Thresholdout is given access to the training dataset St and holdout dataset Sh and a budget limit B.
It allows any query of the form φ : X → [0, 1] and its goal is