contributed articles
DOI: 10.1145/2660766
Preparing data for public release
requires significant attention to
fundamental principles of privacy.
BY ASHWIN MACHANAVAJJHALA AND DANIEL KIFER
IN 2006, AOL RELEASED a file containing search queries
posed by many of its users. The user names were
replaced with random hashes, though the query text
was not modified. It turns out some users had queried
their own names, or “vanity queries,” and nearby
locations like local businesses. As a result, it was not
difficult for reporters to find and interview an AOL
user1 then learn personal details about her (such as
age and medical history) from the rest of her queries.
Could AOL have protected all its users by also
replacing each word in the search queries with a
random hash? Probably not; Kumar et al. 27 showed
that word co-occurrence patterns would provide clues
about which hashes correspond to which words,
thus allowing an attacker to partially reconstruct
the original queries. Such privacy concerns are not
unique to Web-search data. Businesses, government
agencies, and research groups routine-
ly collect data about individuals and
need to release some form of it for a va-
riety of reasons (such as meeting legal
requirements, satisfying business ob-
ligations, and encouraging reproduc-
ible scientific research). However, they
must also protect sensitive informa-
tion, including identities, facts about
individuals, trade secrets, and other
application-specific considerations,
in the raw data. The privacy challenge
is that sensitive information can be
inferred in many ways from the data
releases. Homer et al. 20 showed par-
ticipants in genomic research studies
may be identified from publication of
aggregated research results. Greveler
et al. 17 showed smart meter readings
can be used to identify the TV shows
and movies being watched in a target
household. Coull et al. 6 showed web-
pages viewed by users can be deduced
from metadata about network flows,
even when server IP addresses are re-
placed with pseudonyms. And Goljan
and Fridrich16 showed how cameras
can be identified from noise in the im-
ages they produce.
Naive aggregation and perturbation
of the raw data often leave exposed
channels for making inferences about
sensitive information; 6, 20, 32, 35 for instance, simply perturbing energy readings from a smart meter independently
does not hide trends in energy use.
"Privacy mechanisms," or algorithms
that transform the data to ensure privacy, must be designed carefully according to guidelines set by a privacy
definition. If a privacy definition is
chosen wisely by the data curator, the
sensitive information will be protected.
Designing
Statistical
Privacy for
Your Data
key insights
˽ Data snoopers are highly motivated
to publicize or take advantage of
private information they can deduce
from public data.
˽ History shows simple data anonymization
and perturbation methods frequently leak
sensitive information.
˽ Focusing on privacy design principles
can help mitigate this risk.