In content disclosure, the sensitive data associated with each node is
compromised when social network data is published. This is because
published data may contain users’ personal information, such as
demographics, which have been shown to allow uncovering the identities of users when linked to publicly available data sources [ 15].
Consider, for example, the network data of Figure 1(b). Although, this
data has been de-identified, knowing the age of a particular user
allows an attacker to identify this user in the graph.
lished data have been developed. These algorithms fall into three categories with respect to the data modification techniques they employ.
The first category of algorithms is based on random perturbation.
Algorithms of this category perform local modifications to the structure of the graph representing social network data to achieve privacy.
The main idea behind these algorithms is that attackers will not be
able to use their background knowledge to exclude nodes that do not
match the individual they are trying to re-identify, since all the graphs
that could have been modified using perturbation to produce the
anonymized graph need to be considered instead.
The first algorithm of this kind was proposed by Hay et al. [ 7] and
works in two steps. It first constructs an anonymized graph by deleting m edges of the original graph selected uniformly at random. This
creates an interim graph from which m edges, selected uniformly at
random from the set of non-existent edges in this graph, are subsequently inserted to it.
Consider, for example, an attacker knowing that Anne has two
friends (i.e., a degree of 2) and applying this algorithm on the de-identified version of the graph shown in Figure 2 (left side) using m= 1.
After deleting the edge between Anne and Tom, the algorithm inserts
an edge between Brad and Tom, constructing the graph
of Figure 2 (right side) as a (possible) result. Notice that
the attacker cannot identify the node corresponding to
Anne using the latter graph, even when the attacker has
knowledge of the type of algorithm applied for protection and the parameter m. In other words, this attacker
is not certain about which of the three nodes corresponds to Anne. In fact, the graph illustrated on the right
side of Figure 2 could have been created from either of
the graphs shown in Figure 3. However, this algorithm
does not bound the probability of identifying nodes in
the social network, nor does it minimize data distortion.
Liu and Terzi have also proposed an algorithm based
on random perturbation [ 10]. This algorithm modifies
the graph so as to ensure that there are at least k nodes
having the same degree in the anonymized graph. Thus,
it can prevent identity disclosure when an attacker
knows the degree of some nodes of the graph, because
this attacker will need to distinguish the node corresponding to a target individual
among k nodes sharing the same
degree. In addition, this algorithm
optimizes a simple data utility
function based on the number of
edges added to the graph.
The second type of method is
based on grouping nodes and
edges together in an attempt to
prevent an attacker from distinguishing between nodes belonging
in the same group [ 18]. An algorithm based on this idea was proposed in [ 6]. This algorithm works
by placing the nodes of the original
network data into groups of a controlled size, called “supernodes,”
Methods for Privacy-Preserving Publication
of Social Network Data
In this section, we briefly present some indicative methodologies that have
been proposed to prevent identity, link and content disclosure, thus enabling social network data to be published in a privacy-preserving manner. All these methods are applied on de-identified social network data.
Several algorithms that attempt to prevent identity disclosure by limiting the probability of identifying nodes of the social network in the pub-
Figure 2: An example of a social network and its anonymized counterpart.
The original social network before de-identification is shown on the left, and
the anonymized social network using [ 7] with m = 1 is shown on the right.
Figure 3: Possible graphs that could have been created by applying random perturbation methods
on the social network of Figure 2.