In input perturbation, either the data
or the queries are modified before a
response is generated. This broad category encompasses a generalization of
subsampling, in which the curator first
chooses, based on a secret, random,
function of the query, a subsample
from the database, and then returns the
result obtained by applying the query to
the subsample. 4 A nice feature of this
approach is that repeating the same
query yields the same answer, while
semantically equivalent but syntactially
different queries are made on essentially unrelated subsamples. However,
an outlier may only be protected by the
unlikelihood of being in the subsample.
In what is traditionally called
randomized response, the data itself is
randomized once and for all and statistics are computed from the noisy
responses, taking into account the
distribution on the perturbation. 23 The
term “randomized response” comes
from the practice of having the respondents to a survey flip a coin and, based
on the outcome, answering an invasive yes/no question or answering a
more emotionally neutral one. In the
computer science literature the choice
governed by the coin flip is usually
between honestly reporting one’s value
and responding randomly, typically by
flipping a second coin and reporting
the outcome. Randomized response
was devised for the setting in which
the individuals do not trust the curator, so we can think of the randomized
responses as simply being published.
Privacy comes from the uncertainty
of how to interpret a reported value.
The approach becomes untenable for
complex data.
Adding random noise to the output
has promise, and we will return to it
later; here we point out that if done
naïvely this approach will fail. To see
this, suppose the noise has mean zero
and that fresh randomness is used in
generating every response. In this case,
if the same query is asked repeatedly,
then the responses can be averaged,
and the true answer will eventually
emerge. This is disastrous: an adver-
sarial analyst could exploit this to carry
out the difference attack described
above. The approach cannot be “fixed”
by recording each query and providing
the same response each time a query
is re-issued. There are several reasons
for this. For example, syntactically dif-
ferent queries may be semantically
equivalent, and if the query language is
sufficiently rich, then the equivalence
problem itself is undecidable, so the
curator cannot even test for this.
Theorem 1. Let M be a mechanism
that adds noise bounded by E. Then there
exists an adversary that can reconstruct
the database to within 4E positions. 5
Blatant non-privacy with E = n/401 follows immediately from the theorem,
as the reconstruction will be accurate
in all but at most
positions.
Proof. Let d be the true database. The
adversary can attack in two phases:
1. estimate the number of 1’s in all
possible sets: Query M on all
subsets S Í [n].
2. Rule out “distant” databases:
For every candidate database
c Î {0, 1}n, if, for any S Í [n],
then rule
out c. If c is not ruled out, then
output c and halt.
Since M (S) never errs by more than E,
the real database will not be ruled out, so
this simple (but inefficient!) algorithm
will output some database; let us call
it c. We will argue that the number of
positions in which c and d differ is at
most 4 × E.
Let I0 be the indices in which di = 0,
that is, I0 = {i | di = 0}. Similarly, define
hypergeometric distribution. That
is, the sampling error is on the order
of . We would like that the noise
introduced for privacy is smaller than
the sampling error, ideally .
Unfortunately, noise of magnitude
is blatantly non-private against
a series of n log2 n randomly generated
queries, 5 no matter the distribution on
the noise. Several strengthenings of
this pioneering result are now known.
For example, if the entries in S are
chosen independently according to
a standard normal distribution, then
blatant non-privacy continues to hold
even against an adversary asking only
Q(n) questions, and even if more than
a fifth of the responses have arbitrarily