Learning population
statistics is
acceptable, but
learning how an
individual differs
from the population
is a privacy breach.
privacy definitions that can compose
with prior deterministic data releases,
thus avoiding some of the difficulties
discussed earlier in the section on desiderata. To simplify the discussion,
we set Q to be the single constraint
that the dataset has n records, in which
case IQ = T n; for more complicated
constraints, see He19 on Blowfish and
Kifer and Machanavajjhala25 on Pufferfish frameworks.
The final component of the policy is
G = (T, E), or the “discriminative secret
graph.” The vertices in G are the possible values a record can take. Every
edge (x, y) ∈ E describes a privacy goal
with both counterfactual and simulatability interpretations. From the simulatability viewpoint, changing a single
record from x to y (or vice versa) will
not cause a significant change in the
probability of any output. From the
counterfactual viewpoint, if records
are independent, an attacker could estimate the odds of a new record having
value x vs. y, but estimated odds about
any individual in the data would not
differ significantly from this value. Using this graph G, we define the concept
of neighboring databases, then formally
define the Blowfish framework:
Definition 5 (G-Neighbors). Let P =
(T, G, T n) be a discriminative secret
graph. Two datasets D1, D2 ∈ T n are
called G-neighbors if for some edge (x,
y) ∈ E and some dataset D ∈ T n– 1, D1 = D
∪ {x} and D2 = D ∪ {y}.
Definition 6 ((ε, P)-Blowfish Privacy).
Let P = (T, G, T n) be a policy. An algorithm M satisfies (ε, P)-Blowfish privacy if for all outputs ω of the algorithm
M and all G-neighbors D1, D2 we have
.
This privacy definition clearly
matches the generic template of Definition 4. We now examine some policies and their applications.
Full domain. Consider a policy PK = (T,
G, T n) where K is a complete graph, and
every pair of values in the domain T are
connected. The result is that two datasets are neighbors if they differ (
arbitrarily) in any one record. (ε, PK)-Blowfish
privacy is equivalent to a popular variant of differential privacy11 that requires
for all ω and for all pairs of datasets D1, D2
that differ (arbitrarily) in the value (
rather than presence/absence) of one record.
Partitioned. Let us partition the do-
main T into p mutually exclusive sub-
sets, with P = {P1, …, Pp}. Consider a
graph GP = (T, E), where any two values
x, y are connected by an edge if and
only if x and y appear in the same par-
tition. Each connected component of
GP is thus a clique corresponding to
one of the Pi. Now, two datasets D1 and
D2 are neighbors if D2 can be obtained
from D1 by replacing the value of one
record with a new value belonging to
the same partition. For example, let
T be the set of all disease outcomes,
partitioned into three subsets: healthy
cases, communicable diseases, and
non-communicable diseases. Let us
use the graph GP corresponding to
this partition in our Blowfish policy.
An algorithm M satisfying Defini-
tion 6 comes with the guarantee that
the probabilities of its outputs do not
change substantially if one communi-
cable disease is replaced with another
communicable disease or a healthy
case with another healthy case, or a
simulatability interpretation.
What about replacing a noncom-
municable disease with a communi-
cable disease? Can the algorithm’s
output probabilities be significantly
different in such a case? The answer
is yes. In fact, this policy allows algo-
rithms to publish the exact status of
each individual—healthy, contagious,
or noncontagious—and approximate
histograms of each disease. However,
specific details (such as which person
has which contagious disease) are pro-
tected. Such behavior may be desirable
in certain health-care applications
where some facts must be disclosed
but further details kept confidential.
Distance threshold. Many applications involve a concept of distance
between records; for instance, the distance between two age values can be
the absolute difference, and the distance between two points on a plane
can be the straightline Euclidean
distance or the Manhattan distance
along a grid. Given a distance metric d,
one can define a discriminative secret
graph Gd,θ in which only nearby points
are connected. That is, (x, y) ∈ E only
when d(x, y) < 0 for some threshold q;
for example, if T is the set of all points
on Earth, and d is the orthodromic distance between pairs of points, we can
set θ = 10 miles, so valid record locations are connected to other valid rec-