values range from 0 to 100. The global
sensitivity of qsum is 100. The policy-specific sensitivity of qsum under GL1, 5
is only 5. If, instead, the policy used a
partition graph GP that partitions age
into ranges (such as {0 – 10, 11 – 20,
21 – 30,…, 91 – 100}), then the policy-specific global sensitivity is only 10.
Finally, with the attribute policy, S(qsum,
Gattr, 1) = max
i
(ai – bi).
K-means clustering. For a specific
data-mining result, consider an application of Blowfish to k-means clustering.
Definition 7 (K-means clustering).
Given a set of n vectors {x1, …, xn}, the
k-means clustering problem is to di-
vide these n records among k clusters S
= {S1, …, Sk}, where k ≤ n, so as to mini-
mize the objective function
( 1)
where is the mean of
cluster Si.
The iterative (non-private) k-means
clustering algorithm initializes a set
of k centroids {μ1, μ2,…, μk}, one for
each cluster. These centroids are iteratively updated in two steps: assign
each xj to the cluster with the nearest
centroid, and set each centroid μi to
be the mean of the vectors of its corresponding cluster. The algorithm
terminates after a certain number of
iterations or when the centroids do
not change significantly.
Each iteration (the two steps) are
easily modified to satisfy ε-differential
privacy4, 30 and Blowfish. 19 These steps
require access to the answers to two
queries: qhist, which returns the number of points in each cluster, and qsum,
or the sum of the points in each cluster. As discussed earlier, qsum can be
answered through the Laplace mechanism. Analogously, qhist can be answered with the Laplace mechanism
because it has global sensitivity S(qhist)
= 1 (for differential privacy) and policy-specific global sensitivity S(f, G) = 2 for
all Blowfish policies discussed here.
The policy-specific sensitivity of the
qsum query under Blowfish policies is
typically much smaller than its global
sensitivity so we would thus expect
more accurate clustering under the
Blowfish privacy definitions.
Figure 2 confirms this improve-
ment in utility. For the clustering task,
ord locations that are within 10 miles
of each other. In general, if an indi-
vidual’s location x (in dataset D1) was
changed to another point y (resulting
in a neighboring dataset D2), then an
algorithm satisfying Blowfish with this
policy will have the guarantee that for
all outputs ω
An adversary may thus be able to detect the general geographic region of
a target individual but unable to infer
the location with a resolution better
than 10 miles. Such a relaxed notion
of privacy is reasonable when dealing
with location data; individuals may
not want disclosure of their precise locations but be less worried about disclosing their information at a coarser
granularity (that may be obtained from
other sources). As we show later, data
output by mechanisms that satisfy
such relaxed notions of privacy permit
data mining results with greater accuracy than if data is generated using
mechanisms that satisfy the stricter
notion of differential privacy.
Attribute. Let T be a multi-attribute
domain with m attributes T = A1 × A2 ×
…, × Am. Consider a graph Gattr,c connect-
ing any two values x and y that differ in
at most c attribute values. A Blowfish
policy with this graph is useful for lo-
cation traces and genome data. For
the former, attributes correspond to
locations of an individual at different
times. Neighboring datasets thus dif-
fer in at most c locations of a person,
hiding the specific details about every
sequence of c consecutive locations
of an individual. In the genome case,
an attribute corresponds to a specific
position on the genome. Under this
policy, an algorithm’s output would be
insensitive to changes to a block of up
to c positions on the genome.
Answering queries with Blowfish.
Recall that adding Laplace noise with
0 mean and standard devia-
tion to a function f (where S(f) is the sen-
sitivity of f) ensures ε-differential privacy.
Blowfish, with a policy P = (T, G, T n) is
also compatible with additive Laplace
noise and requires an often smaller
standard deviation of
where S(f, G) is the policy-specific
global sensitivity of f—the largest dif-
ference over all data-
sets D1, D2 that are G-neighbors.
Consider a multidimensional rec-ord domain T = A1 × A2 × …, × Am where
each attribute is numeric. Let qsum denote the function that sums all the
records together; that is, for each attribute, it computes the sum of the values that appear in the data. Let ai and
bi denote the maximum and minimum
values in attribute Ai. The global sensitivity S(qsum) of qsum is .
The policy-specific global sensitivity of
qsum under Blowfish policies is usually
much smaller. In the case of the distance threshold policy Gd,θ with d being
the L1 Manhattan distance, S(qsum, Gd,θ)
is only θ. Consider a single attribute domain Age and further suppose the age
Figure 2. K-means under several Blowfish policies.
1
2
4
8
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Erro
r
Epsilon (privacy parameter)
laplace
attribute
blowfish|256
blowfish|128
blowfish| 64
blowfish| 32