Theorem 6. For k = I∞(S; φ), Pr[S ∈ R(φ)] ≤ 2k ⋅ maxφ Pr[S ∈
The proof follows easily by first decomposing the event
S ∈ R(φ) into events, S ∈ R(φ) & φ = φ for all φ. Namely,
we can apply the definition of max-information and obtain
that Pr[S ∈ R(φ) | φ = φ] ≤ 2k Pr[S ∈ R(φ)]. Substituting this
bound back into the decomposition gives the desired
Our theorem is completely general in the sense that the
random variable φ does not have to be supported on func-
tions over X and could instead assume values in any other
discrete domain. For example, such output could be a set of
features of the data to be used for a subsequent supervised
learning task. For our main application φ refers to a func-
tion, and we denote the set of datasets on which the empiri-
cal estimator has error greater than τ as
By Hoeffding’s bound we know that maxφ Pr[S ∈ Rτ (φ)] ≤
exp(−2τ2n). This gives the following immediate corollary.
Corollary 7. If I∞(S; φ) ≤ log2 e ⋅ τ2n, then Pr[S ∈ Rτ (φ)] ≤
To apply this corollary all we need is to show that pure
differential privacy implies a sufficiently strong bound on
max information I∞(S; φ).
Theorem 8. Let M be an ε-differentially private algorithm.
Let S be any random variable over n-element input datasets for
M and let Y be the corresponding output distribution Y = M(S).
Then I∞(S; Y) ≤ log2 e ⋅ εn.
The proof of this theorem follows from observing that,
any two datasets S and S′ differ in at most n elements.
Therefore, applying the guarantee of differential privacy
n times, we obtain that for every y,
As there must exist a dataset y such that Pr[ Y = y | S = S′] ≤
Pr[ Y = y] we can conclude that for every S and every y it holds
that Pr[ Y = y | S = S] ≤ eεn Pr[ Y = y]. This yields the desired
bound I∞(S; Y) = I∞(Y; S) ≤ log2 e ⋅ εn.
1. The algorithm can answer every query φ with a value,
that is, close (up to error α) to the empirical average of
φ on the dataset.
2. The algorithm is differentially private.
The problem of providing accurate answers to a large number of queries for the average value of a function on the
dataset has been the subject of intense investigation in
the differential privacy literature. Such queries are usually
referred to as (fractional) counting queries or linear queries in
this context. This allows us to obtain statistical query answering algorithms by using various known differentially private
algorithms for answering counting queries. Specifically, our
Theorem 1 relies on the algorithm in Ref.
15 that uses the multiplicative weights update algorithm to answer the queries.
Our Theorem 2 relies on the basic Laplace noise mechanism
and strong composition properties of differential.
In the resulting algorithm, α should be viewed as bounding the empirical error, ε should be viewed as bounding the
generalization error and τ = α + ε as bounding the total error.
Notice that the standard approach of using empirical averages has the optimal empirical error—it has α = 0. However,
it is not ε-differentially private for any ε and, as we pointed
out earlier, does not provide any guarantee on the generalization error. At the opposite end, an algorithm which
answers queries with a constant, independent of the data,
has optimal generalization error, but horrible empirical
error. Differentially private algorithms for answering counting queries allow us to explicitly trade off empirical error α
with generalization error ε to obtain a strong bound on the
total error τ.
3. 1. Max-information
Intuitively, one way to ensure that the function output by an
algorithm M generalizes is to guarantee that the function
does depend too much on the input dataset S. We demonstrate that this intuition can be captured via the notion of
max-information that we introduce.
Definition 5. Let X and Y be jointly distributed random
variables. The max-information between X and Y, denoted
I∞(X; Y), is the minimal value of k such that for every x in the
support of X and y in the support of Y we have Pr[X = x | Y = y]
≤ 2k ⋅ Pr[X = x].
It follows immediately from Bayes’ rule that I∞(X; Y) = I∞(Y; X).
In our use (X, Y) is going to be a joint distribution (S, φ) on
(dataset, function) pairs. The dataset S is drawn from distribution P n that corresponds to n points drawn i.i.d. from P.
Random variable φ represents the function generated by
the analyst, whereas interacting with S through our mechanism. Importantly, the analyst may arrive at the function
after observing the evaluations of other functions on the
same dataset S. Now with each possible function φ in the
support of φ we associate a set of “bad” datasets R(φ). We
later choose R(φ) to mean the empirical value ES[φ] is far
from the true value P[φ], that is, φ overfits to S. Maximum
information gives a bound on the probability that S falls