However, the theorem also shows we
can sometimes do better. The logical-AND count we discussed earlier, even
though it involves two different bits
in each row, still only has sensitivity 1:
The number of 2-bit rows whose entries
are both 1 can change by at most 1 with
the addition or deletion of a single row.
Thus, this more complicated query can
be answered in an e-differentially private fashion using noise distributed as
Lap(1/e); we do not need to use the distribution Lap(2/e).
Histogram Queries. The power of
Theorem 2 really becomes clear when
considering histogram queries, defined
as follows. If we think of the rows of
the database as elements in a universe
X, then a histogram query is a partitioning of X into an arbitrary number
of disjoint regions X1, X2, . . . , Xd. The
implicit question posed by the query is:
“For i = 1, 2, . . . , d, how many points
in the database are contained in Xi?”
For example, the database may contain
the annual income for each respondent, and the query is a partitioning of
incomes into ranges: {[0, 50K), [50K,
100K), . . . , ³ 500K}. In this case d = 11,
and the question is asking, for each of
the d ranges, how many respondents
in the database have annual income
in the given range. This looks like d
separate counting queries, but the
entire query actually has sensitivity Df
= 1. To see this, note that if we remove
one row from the database, then only
one cell in the histogram changes, and
that cell only changes by 1; similarly for
adding a single row. So Theorem 2 says
that e-differential privacy can be maintained by perturbing each cell with
an independent random draw from
Lap(1/e). Returning to our example of
2-bit rows, we can pose the 4-ary histogram query requesting, for each pair of
literals v1v2, the number of rows with
value v1v2, adding noise of order 1/e to
each of the four cells.
When Noise Makes No Sense. There
are times when the addition of noise
for achieving privacy makes no sense.
For example, the function f might
map databases to strings, strategies,
or trees, or it might be choosing the
“best” among some specific, not necessarily continuous, set of real-valued
objects. The problem of optimizing the
there are times
when the addition
of noise for
achieving privacy
makes no sense.
output of such a function while preserving e-differential privacy requires
additional technology.
Assume the curator holds a database D and the goal is to produce an
object y. The exponential mechanism19
works as follows. We assume the existence of a utility function u(D, y) that
measures the quality of an output y,
given that the database is D. For example, the data may be a set of labeled
points in Rd and the output y might
be a d-ary vector describing a (d − 1)-
dimensional hyperplane that attempts
to classify the points, so that those
labeled with + 1 have non-negative
inner product with y and those labeled
with − 1 have negative inner product. In
this case the utility would be the number of points correctly classified, so that
higher utility corresponds to a better
classifier. The exponential mechanism
outputs y with probability proportional
to exp(u(D,y)e/Du) and ensures e
-differential privacy. Here Du is the sensitivity
of the utility function bounding, for all
databases (D, D¢) differing in only one
row and potential outputs y, the difference |u(D, y) − u(D¢,y)|. In our example,
Du = 1. The mechanism assigns most
mass to the best classifier, and the
mass assigned to any other drops off
exponentially in the decline in its utility for the current dataset—hence the
name “exponential mechanism.”
When Sensitivity Is Hard to Analyze.
The Laplace and exponential mechanisms provide a differentially private
interface through which the analyst
can access the data. Such an interface
can be useful even when it is difficult to
determine the sensitivity of the desired
function or query sequence; it can also
be used to run an iterative algorithm,
composed of easily analyzed steps, for
as many iterations as a given privacy
budget permits. This is a powerful
observation; for example, using only
noisy sum queries, it is possible to
carry out many standard data mining
tasks, such as singular value decompositions, finding an ID3 decision
tree, clustering, learning association
rules, and learning anything learnable in the statistical queries learning
model, frequently with good accuracy,
in a privacy-preserving fashion. 2 This
approach has been generalized to
yield a publicly available codebase for