// Section 3. 4 explains this, and why it is needed
keyFunc = Purify(keyFunc) as Expression<Func<T,K>>;
// new agent with appropriate ancestor and stability
var newagent = new PINQAgentUnary( this.agent, 2.0);
// new data source reflecting the operation
var newsource = this.source.GroupBy(keyFunc);
// construct and return a new source and agent pair
return new PINQueryable<IGrouping<K,T>>(newsource,
The Join transformation is our main deviation from
LINQ. To ensure stability one with respect to each input,
we only report pairs that are the result of unique key
matches. To ensure these semantics, PINQ’s Join invokes
LINQ’s GroupBy on each input, using their key selection
functions. Groups with more than one element are discarded, and the resulting singleton elements are joined
using LINQ’s Join.
While this clearly (and intentionally) interferes with
standard uses of Join, analysts can reproduce its standard
behavior by first invoking GroupBy on each data set, ensuring that there is at most one record per group, before invoking Join. The difference is that the Join is now required
to reduce pairs of groups, rather than pairs of records.
Each pair of groups yields a single result, rather than the
unbounded cartesian product of the two, constraining the
output but enabling privacy guarantees.
3. 3. the partition operator
Theorem 4 tells us that structurally disjoint queries cost
only the maximum privacy differential, and we would like
to expose this functionality to the analyst. To that end, we
introduce a Partition operation, like GroupBy, but in
which the analyst must explicitly provide a set of candidate
keys. The analyst is rewarded with a set of PINQueryable
objects, one for each candidate key, containing the (
possibly empty) subset of records that map to the each of the
associated keys. It is important that PINQ does not reveal
the set of keys present in the actual data, as this would violate differential privacy. For this reason, the analyst must
specify the keys of interest, and PINQ must not correct
them. Some subsets may be empty, and some records may
not be reflected in any subset.
The PINQAgent objects of these new PINQueryable
objects all reference the same source PINQAgent, of the
source data, but following Theorem 4 will alert the agent
only to changes in the maximum value of epsilon. The
agents share a vector of their accumulated epsilon values
since construction, and consult this vector with each update
to see if the maximum has increased. If so, they forward the
change in maximum. If the maximum has not increased,
they accept the request.
Partition in PINQ can be seen in the following two
Q1. How many ZIP codes contain at least 10 patients?
Q2. For each ZIP code, how many patients live there?
For Q1, a GroupBy by ZIP, a Where on the number of
patients, and a Count gives an approximate answer to the
exact number of ZIP codes with at least 10 patients. For
Q2, a Partition by ZIP, followed by a Count on each part
returns an approximate count for each ZIP code. As the measurements can be noisy, neither query necessarily provides
a good estimate for the other. However, both are at times
important questions, and PINQ is able to answer either
accurately depending on how the question is posed.
The Partition operator can be followed not only by
aggregation but by further differentially private computation on each of the parts. It enables a powerful recursive
descent programming paradigm demonstrated in Section 4,
and is very important in most nontrivial data analyses.
3. 4. security issues in implementation
Although the stability mathematics, composition properties,
and definition of differential privacy provide mathematical
guarantees, they do so only when PINQ’s behavior is in line
with our mathematical expectations. There are many important but subtle implementation details intended to protect
against clever attackers who might use the implementation details of PINQ to learn information the mathematics
would conceal. These are largely the result of user-defined
code that may attempt to pass information out through side
channels, either directly through disk or network channels,
or indirectly by throwing exceptions or simply not terminating. PINQ’s purify function gives the provider the opportunity to examine incoming methods and rewrite them,
either by restricting the computations to those comprised
of known-safe methods, or by rewriting the methods with
appropriate guards. There are other issues and countermeasures in the full paper, and likely more unrecognized issues
to be discovered and addressed.
4. aPPLiCations anD eVaLuation
In this section, we present data analyses written with
PINQ. Clearly not all analysis tasks can be implemented
in PINQ (indeed, this is the point), but we aim to convince
the reader that the set is sufficiently large as to be broadly
Our main example application is a data visualization
based on Web search logs containing IP addresses and query
text. The application demonstrates many features of PINQ
largely absent from other privacy-preserving data analysis
platforms. These include direct access to unmodified data,
user-supplied record-to-record transformations, operations
such as GroupBy and Join on “sensitive” attributes, multiple independent data sets, and unfettered integration into
For our experiments we use the DryadLINQ15 provider.
DryadLINQ is a research LINQ provider implemented on
top of the Dryad9 middleware for data parallel computation,