writing programs that ensure differential privacy. 18
k-Means Clustering. As an example
of “private programming,” 2 consider
k-means clustering, described first in its
usual, non-private form. The input consists of points p1, . . . , pn in the d
-dimen-sional unit cube [0, 1]d. Initial candidate
means m1, . . . , mk are chosen randomly
from the cube and updated as follows:
1. Partition the samples {pi} into k
sets S1, . . ., Sk, associating each pi
with the nearest mj.
2. For 1 £ j £ k, set
the mean of the
associated with mj.
,
samples
This update rule is typically iterated
until some convergence criterion has
been reached, or a fixed number of
iterations have been applied.
Generating synthetic Data
The idea of creating a synthetic dataset
whose statistics closely mirror those
of the original dataset, but which preserves privacy of individuals, was proposed in the statistics community no
later than 1993.21 The lower bounds on
noise discussed at the end of Section
on “How Is Hard” imply that no such
dataset can safely provide very accurate
answers to too many weighted subset
sum questions, motivating the interactive approach to private data analysis discussed herein. Intuitively, the
advantage of the interactive approach
is that only the questions actually
asked receive responses.
Against this backdrop, the non-interactive case was revisited from a
learning theory perspective, challenging the interpretation of the noise
lower bounds as a limit on the number
of queries that can be answered privately. 3 This work, described next, has
excited interest in interactive and non-interactive solutions yielding noise in
the range .
Let X be a universe of data items
and let C be a concept class consisting
of functions c : X ® {0, 1}. We say x Î X
satisfies a concept c Î C if and only if c(x)
= 1. A concept class can be extremely
general; for example, it might consist
of all rectangles in the plane, or all
Boolean circuits containing a given
number of gates.
Given a sufficiently large database
D Î Xn, it is possible to privately generate a synthetic database that maintains approximately correct fractional
counts for all concepts in C (there may
be infinitely many!). That is, letting
S denote the synthetic database produced, with high probability over the
choices made by the privacy mechanism, for every concept c Î C, the fraction of elements in S that satisfy c is
approximately the same as the fraction
of elements in D that satisfy c.
The minimal size of the input database depends on the quality of the
approximation, the logarithm of the
cardinality of the universe X, the privacy parameter e, and the Vapnick–
Chervonenkis dimension of the concept
class C (for finite |C| this is at most
log2 |C|). The synthetic dataset, chosen
by the exponential mechanism, will be
a set of m = O(VCdim(C)/γ2), elements
in X (γ governs the maximum permissible inaccuracy in the fractional count).
Letting D denote the input dataset and
D a candidate synthetic dataset, the
utility function for the exponential
mechanism is given by
Pan-Privacy
Data collected by a curator for a given
purpose may be subject to “mission
creep” and legal compulsion, such
as a subpoena. Of course, we could
analyze data and then throw it away,
but can we do something even stronger, never storing the data in the first
place? Can we strengthen our notion
of privacy to capture the “never store”
requirement?
These questions suggest an investigation of differentially private streaming algorithms with small state—much
too small to store the data. However,
nothing in the definition of a streaming algorithm, even one with very
small state, precludes storing a few
individual data points. Indeed, popular techniques from the streaming
literature, such as Count-Min Sketch
and subsampling, do precisely this.
In such a situation, a subpoena or
other intrusion into the local state will
breach privacy.
A pan-private algorithm is private
“inside and out,” remaining differentially private even if its internal state
becomes visible to an adversary. 10 To
understand the pan-privacy guarantee, consider click stream data. This
data is generated by individuals, and
an individual may appear many times
in the stream. Pan-privacy requires
that any two streams differing only
in the information of a single individual should produce very similar
distributions on the internal states of
the algorithm and on its outputs, even
though the data of an individual are