classification, and contingency table measurement. These
examples have all been relatively easy adaptations of existing approaches. 2, 4
2. mathematiCaL founDations
We now develop some supporting mathematics for PINQ.
We review the privacy definition we use, differential privacy, and develop several properties necessary to design
a programming language supporting its guarantees.
Specifically, we discuss the data types we can support, common differentially private aggregations, ho w transformations
of the data sets impact privacy, and how privacy guarantees of
multiple analyses compose. All of our conclusions are immediate consequences of differential privacy, rather than additional assumptions or implementation details. The proofs
are available in the full version of the paper. 10
2. 1. Differential privacy
Differential privacy is a relatively new privacy definition,
building upon the work of Dwork et al. 8 and publicly articulated in Dwork. 5 It differs from most previous definitions in
that it does not attempt to guarantee the prevention of data
disclosures, privacy violations, or other bad events; instead,
it guarantees that participation in the data set is not their
The definition of differential privacy requires that a randomized computation yield nearly identical distributions
over outcomes when executed on nearly identical input data
sets. Treating the input data sets as multisets of records over
an arbitrary domain and using for symmetric difference
(i.e., A B is the set of records in A or B, but not both):
Definition 1. A randomized computation M provides -
differential privacy if for any two input data sets A and B, and
any set of possible outputs S of M,
For values of x much less than one, exp(x) is approximately
1 + x. Differential privacy relies only on the assumption that
the data sets are comprised of records, of any data type, and
is most meaningful when there are few records for each participant, relative to 1/.
The definition is not difficult to motivate to nonexperts.
A potential participant can choose between two inputs to
the computation M: a data set containing their records (A)
and the equivalent data set with their records removed (B).
Their privacy concerns stem from the belief that these two
inputs may lead to noticeably different outcomes for them.
However, differential privacy requires that any output event
(S) is almost as likely to occur with these records as without.
From the point of view of any participant, computations
which provide differential privacy behave almost as if their
records had not been included in the analysis.
Taking a concrete example, consider the sensible con-
cern of most Web search users that their name and search
history might appear on the front page of the New York
Times. 3 For each participant, there is some set S of outputs
of M that would prompt the New York Times to this publica-
tion; we do not necessarily know what this set S of outputs is,
but we need not define S for the privacy guarantees to hold.
For all users, differential privacy ensures that the probability
the New York Times publishes their name and search history
is barely more than had it not been included as input to M.
Unless the user has made the queries public in some other
way, we imagine that this is improbable indeed.
2. 2. Basic aggregations
The simplest differentially private aggregation (from Dwork
et al. 8) releases the number of records in a data set, after the
addition of symmetric exponential (Laplace) noise, scaled
by (Figure 2). The Laplace distribution is chosen because
it has the property that the probability of any outcome
decrease by a factor of exp() with each unit step away from
its mean. Translating its mean (shifting the true value) by
one unit scales the probability of any output by a multiplicative factor of at most exp(). Changing an input data set from
A to B can shift the true count by at most |A B|, and consequently a multiplicative change of at most exp( × |A B|) in
the probability of any outcome.
Theorem 1. The mechanism M(X) = |X| + Laplace (1/)
provides -differential privacy.
The Laplace distribution has exponential tails in both
directions, and the probability that the error exceeds t/ in
either direction is exponentially small in t. Consequently,
the released counts are likely to be close to the true counts.
Other Primitive aggregations: There are many other mechanisms that provide differential privacy; papers on the subject typically contain several. To date each has privacy established as above, by written mathematical proof based on
intended behavior. While this is clearly an important step in
developing such a computation, the guarantees are only as
convincing as the proof is accessible and the implementation is correct.
Our goal is to enable the creation of as many differentially
private computations as possible using only a few primitive
components, whose mathematical properties and implementations can be publicly scrutinized and possibly verified.
While we shouldn’t preclude the introduction of novel primitives, they should be the exceptional, rather than default,
approach to designing new differentially private algorithms.
2. 3. stable transformations
Rather than restrict programmers to a fixed set of aggregations, we intend to supply analysts with a programming
language they can use to describe new and unforeseen computations. Most of the power of PINQ lies in arming the
figure 2. adding symmetric exponential noise to counts causes the
probability of any output (or set of outputs) to increase or decrease
by at most a multiplicative factor when the counts are translated.