OPEN A NEWSPAPER or a Web browser and you’re certain to encounter a spate of stories about the misuse or loss of data and how it puts personal information at risk. Over the last decade, as computers and databases have grown ever more sophisticated, privacy concerns have moved to center stage. Today, government agencies worry about keeping highly sensitive financial and health data private. Corporations fret over protecting customer records. And the public grows ever more wary—and distrustful—of organizations that handle sensitive data.
“Privacy issues aren’t about to go away,” observes Adam Smith, an assistant professor in the computer science and engineering department at the Pennsylvania State University. “One problem we face is that ‘privacy’ is an overloaded term. It means different things to different people and a lot of issues hinge on context. As a result, it is extremely difficult to create effective solutions and protections—and to gain the trust that is necessary for respondents to answer sensitive questions honestly.”
Some 220 million private records have been lost or stolen in the United States since January 2005, according to the Privacy Rights Clearinghouse, a San Diego, CA-based organization that tracks privacy issues. While no worldwide statistics exist, it’s entirely apparent that a tangle of regulations, la ws, and best practices cannot solve the problem. Worse, increasingly sophisticated tools make it possible to piece information together and glean details and facts about people in a way that wasn’t imaginable a few years ago.
Now, a handful of researchers, mathematicians, and computer scientists are hoping to alter the landscape and frame the debate in new and important ways. Introducing a concept that has been dubbed “differential privacy,” these data experts are seeking to use mathematical equations and algorithms to standard-
ize the way computers—and organizations—protect personal data while revealing overall statistical trends. The goal, says Cynthia Dwork, a principal researcher at Microsoft, is to ensure that an adversary cannot compromise data when he or she combines the released statistics with other external sources of information. “It’s an extremely attractive approach,” she says.
The ability to collect and analyze vast data sets offers substantial promise. Sifting through medical data, genotype and phenotype connections, epidemiological statistics, and their correlation with events such as chemical spills or dietary and exercise patterns can help dictate public policy and find preventive strategies and cures for real people with real afflictions.
Yet, protecting privacy is an increasingly tricky proposition and one that confounds a growing number of organizations. Beyond the widely publicized hacker attacks and security lapses, there’s an escalating threat of a person or organization assembling enough pieces of seemingly benign data—sometimes from different sources—to create a useful snapshot of a person or group. Kobbi
Nissim, an assistant professor of computer science at Ben-Gurion University, describes this approach as “connecting the dots.” Oftentimes, it involves culling seemingly unrelated data from diverse and disparate sources.
It’s not an abstract concept. When online movie rental firm Netflix decided to improve its recommendation system in 2007, executives emphasized that they would provide complete customer anonymity to participants. Netflix designed a system that retained the date of each movie rating along with the title and year of its release. And it assigned randomized numbers in place of customer IDs.
This seemed like a perfect system until a pair of researchers—graduate student Arvind Narayanan and professor Vitaly Shmatikov, both from the department of computer sciences at the University of Texas at Austin—proved that it was possible to identify individuals among a half-million participants by using public reviews published in the Internet Movie Database (IMDb) to identify movie ratings within Netflix’s data. In fact, eight ratings along with dates were enough to provide 99% accuracy, according to the researchers.
This type of privacy violation— known as a linkage attack (attackers use innocuous data in one data set to identify a record in a second data set with both innocuous and sensitive data)— has serious repercussions, Dwork says. It could identify someone who is gay or has an interest in extremely violent or pornographic films. Such information might potentially interfere with a person’s employment or affect his or her ability to rent an apartment or belong to a religious organization. “It could result in public humiliation,” says Dwork, who notes that “the conclusion may be wrong. Partners share accounts. People buy gifts, and they may have some other reason for renting or buying certain movies.”
It’s not the first time such an event has taken place. In 2006, researchers
References:
Archives