As we will see next, there are deep
reasons for the fact that auxiliary information plays such a prominent role in
these examples.
Dalenius’s Desideratum
In 1977 the statistician Tore Dalenius
articulated an “ad omnia” (as opposed
to ad hoc) privacy goal for statistical
databases: Anything that can be learned
about a respondent from the statistical
database should be learnable without
access to the database. Although informal, this feels like the “right” direction. The breadth of the goal captures
all the common intuitions for privacy.
In addition, the definition only holds
the database accountable for whatever
“extra” is learned about an individual,
beyond that which can be learned from
other sources. In particular, an extrovert who posts personal information
on the Web may destroy his or her own
privacy, and the database should not
be held accountable.
Formalized, Dalenius’ goal is strikingly similar to the gold standard for
security of a cryptosystem against a
passive eavesdropper, defined five
years later. Semantic security captures
the intuition that the encryption of
a message reveals no information
about the message. This is formalized
by comparing the ability of a computationally efficient adversary, having
access to both the ciphertext and any
auxiliary information, to output (
anything about) the plaintext, to the ability of a computationally efficient party
having access only to the auxiliary
information (and not the ciphertext),
to achieve the same goal. 13 Abilities are
measured by probabilities of success,
where the probability space is over the
random choices made in choosing the
encryption keys, the ciphertexts, and
by the adversaries. Clearly, if this difference is very, very tiny, then in a rigorous sense the ciphertext leaks (almost)
no information about the plaintext.
The formal definition of semantic
security is a pillar of modern cryptography. It is therefore natural to ask
whether a similar property, such as
Dalenius’ goal, can be achieved for statistical databases. But there is an essential difference in the two problems.
Unlike the eavesdropper on a conversation, the statistical database attacker
is also a user, that is, a legitimate
the formal
definition of
semantic security
is a pillar of modern
cryptography.
it is therefore
natural to ask
whether a similar
property, such as
Dalenius’ goal,
can be achieved
for statistical
databases.
consumer of the information provided
by the statistical database (not to mention the fact that he or she may also be
a respondent in the database).
Many papers in the literature
attempt to formalize Dalenius’ goal (in
some cases unknowingly) by requiring
that the adversary’s prior and posterior beliefs about an individual (that
is, before and after having access to
the statistical database) should not be
“too different,” or that access to the
statistical database should not change
the adversary’s views about any individual “too much.” The difficulty with
this approach is that if the statistical
database teaches us anything at all,
then it should change our beliefs about
individuals. For example, suppose the
adversary’s (incorrect) prior view is
that everyone has two left feet. Access
to the statistical database teaches that
almost everyone has one left foot and
one right foot. The adversary now has
a very different view of whether or not
any given respondent has two left feet.
But has privacy been compromised?
The last hopes for Dalenius’ goal
evaporate in light of the following parable, which again involves auxiliary
information. Suppose we have a statistical database that teaches average
heights of population subgroups, and
suppose further that it is infeasible
to learn this information (perhaps
for financial reasons) in any other
way (say, by conducting a new study).
Finally, suppose that one’s true height
is considered sensitive. Given the auxiliary information “Turing is two inches
taller than the average Lithuanian
woman,” access to the statistical database teaches Turing’s height. In contrast, anyone without access to the
database, knowing only the auxiliary
information, learns much less about
Turing’s height.
A rigorous impossibility result generalizes this argument, extending to
essentially any notion of privacy compromise, assuming the statistical database is useful. The heart of the attack
uses extracted randomness from the
statistical database as a one-time pad
for conveying the privacy compromise
to the adversary/user. 6, 9
Turing did not have to be a member
of the database for the attack described
earlier to be prosecuted against
him. More generally, the things that