and the edges by “superedges,” which report the density of edges in the
original graph. This process ensures that the probability of identifying
a node in the published data, which is related to the of the supernode
it belongs, is limited. Furthermore, it attempts to preserve utility by
finding the partitioning that best retains structural characteristics of
the original graph using a fitting model based on a maximum likelihood approach.
Finally, Zhou and Pei [ 18] have developed an algorithm which combines random perturbation and data generalization to prevent identity
disclosure. Different from the previous two types of methods, their
algorithm takes into account the labels of nodes (containing user
attributes), and performs anonymization by generalizing these attributes and by inserting edges. However, it considers a limited class of
attackers with knowledge of the induced subgraph of a node and its
immediate neighbors and quantifies information loss using a heuristic
measure based on the number of edges and nodes added to the
anonymized neighborhoods and the way labels are generalized.
Link Disclosure
Perturbation of social network data has also been applied to thwart
link disclosure attacks by Ying and Wu [ 16]. The authors considered
how a graph can be published in a form that conceals the sensitive
links, while preserving the spectrum of the original graph, which
reflects topological properties including diameter, long paths, and the
cluster structure. Two methods based on edge modification were also
proposed. The first one repeatedly adds a random edge, and subsequently deletes another edge, so as to preserve the number of edges.
The second approach swaps pairs of edges in a way that the degree
distribution of the original graph is not affected.
Instead of modifying the graph structure, Lescovec and Faloutsos [ 9]
proposed generating a graph with a different structure than the original, but with similar topological properties, such as degree distribution,
diameter, or spectrum. The intuition behind this approach is that the
resultant graph would still protect privacy, while allowing graph analysis in practice. An efficient, linear algorithm based on a graph fitting
model was also developed. The underlying assumption of this algorithm
is that the employed graph fitting model is able to generate graphs that
Figure 4: A 2-anonymization of the social network of Figure 1a
is shown. (The asterisk denotes a character that has been removed,
and age values are transformed into intervals.)
obey many of the patterns found in real graphs. The effectiveness of this
approach was experimentally verified using real-world data.
Content Disclosure
Preventing content disclosure may be achieved by applying perturbation and anonymization approaches originally developed for relational
data [ 1], after representing users’ identity and attributes contained in
the label as a relational table. An example of preventing content disclosure through the application of 2-anonymity [ 15], is shown in
Figure 4. Notice that in this data no user can be identified based on the
set of attributes {Age, Sex, Zip} with a probability of more than 1/2.
Staying Connected, Privately
Privacy-preserving publication of social networks is a very promising,
challenging, and rapidly evolving research area. Ensuring privacy in
this context still requires a number of issues to be carefully addressed.
First, although there has been much progress in terms of designing
algorithms to protect privacy in relational data, these algorithms are
generally not applicable in the context of social network data due to its
significant complex structure. This calls for effective and efficient
algorithms to protect social network data from identity and link disclosure. On the positive side, principles and methodologies originally
developed for relational data, such as k-anonymity and generalization
[ 15] can provide the basis for designing such algorithms.
Second, since a social network is an environment in which millions
of users interact with one another on a daily basis, both the dynamic
nature of the data and the privacy expectations of users need to be
taken into account to design useful methodologies for publishing
social network data.
Biographies
Grigorios Loukides ( grigorios.loukides@vanderbilt.edu) is currently a
postdoctoral research fellow in the Department of Biomedical Informatics at Vanderbilt University. He holds a BSc from the University of
Crete, Greece, and a PhD from Cardiff University, both in computer science. His research interests lie broadly in the fields of privacy and
trust in data management and emerging database applications.
Aris Gkoulalas-Divanis ( aris.gkoulalas@vanderbilt.edu) has a BS from
the University of Ioannina, an MS from the University of Minnesota, and
a PhD from the University of Thessaly in computer science. He is currently
a postdoctoral research fellow in the Department of Biomedical Informatics
at Vanderbilt University. He has served as a research assistant in both the
University of Minnesota (2003-2005) and the University of Manchester
(2006). His research interests are in the areas of privacy preserving data
mining, privacy in medical records and privacy in location-based services.
References
1. Aggarwal, C. C. and Yu, P. S. 2008. Privacy-preserving data mining:
Models and algorithms. In Advances in Database Systems. Springer.
2. Backstrom, L., Dwork, C., and Kleinberg, J. 2007. Wherefore art
thou r3579x?: Anonymized social networks, hidden patterns, and
structural steganography. In Proceedings of the International World
Wide Web Conference. 181-190.
3. Chew, M., Balfanz, D., and Laurie, B. 2008. (Under)mining privacy in
social networks. W2SP: Web 2.0 Security and Privacy, W2SP website.